﻿WEBVTT

00:00:10.512 --> 00:00:15.376
- Good morning.
So, it's 12:03 so, I want to get started.

00:00:15.376 --> 00:00:18.014
Welcome to Lecture 12, of CS-231N.

00:00:18.014 --> 00:00:21.840
Today we are going to talk about Visualizing
and Understanding convolutional networks.

00:00:21.840 --> 00:00:25.270
This is always a super fun lecture to give
because we get to look a lot of pretty pictures.

00:00:25.270 --> 00:00:28.375
So, it's, it's one of my favorites.

00:00:28.375 --> 00:00:30.354
As usual a couple administrative things.

00:00:30.354 --> 00:00:39.544
So, hopefully your projects are all going well, because as a reminder your milestones
are due on Canvas tonight. It is Canvas, right? Okay, so want to double check, yeah.

00:00:39.545 --> 00:00:43.590
Due on Canvas tonight, we are working on
furiously grading your midterms.

00:00:43.590 --> 00:00:49.537
So, we'll hope to have those midterms grades
to you back by on grade scope this week.

00:00:49.537 --> 00:00:54.987
So, I know that was little confusion, you all got registration
email's for grade scope probably in the last week.

00:00:54.988 --> 00:00:57.372
Something like that, we start
couple of questions on piazo.

00:00:57.372 --> 00:00:59.530
So, we've decided to use grade
scope to grade the midterms.

00:00:59.530 --> 00:01:02.973
So, don't be confused, if you
get some emails about that.

00:01:02.973 --> 00:01:05.047
Another reminder is that assignment three

00:01:05.047 --> 00:01:07.412
was released last week on Friday.

00:01:07.412 --> 00:01:11.088
It will be due, a week from
this Friday, on the 26th.

00:01:11.088 --> 00:01:12.595
This is, an assignment three,

00:01:12.595 --> 00:01:14.444
is almost entirely brand new this year.

00:01:14.444 --> 00:01:17.152
So, it we apologize for taking
a little bit longer than

00:01:17.152 --> 00:01:18.847
expected to get it out.

00:01:18.847 --> 00:01:20.272
But I think it's super cool.

00:01:20.272 --> 00:01:22.644
A lot of that stuff, we'll
talk about in today's lecture.

00:01:22.644 --> 00:01:25.283
You'll actually be implementing
on your assignment.

00:01:25.283 --> 00:01:27.188
And for the assignment, you'll
get the choice of either

00:01:27.188 --> 00:01:29.575
Pi torch or tensure flow.

00:01:29.575 --> 00:01:30.921
To work through these different examples.

00:01:30.921 --> 00:01:34.512
So, we hope that's really
useful experience for you guys.

00:01:34.512 --> 00:01:35.822
We also saw a lot of activity

00:01:35.822 --> 00:01:37.273
on HyperQuest over the weekend.

00:01:37.273 --> 00:01:39.084
So that's, that's really awesome.

00:01:39.084 --> 00:01:40.549
The leader board went up yesterday.

00:01:40.549 --> 00:01:42.568
It seems like you guys are
really trying to battle it out

00:01:42.568 --> 00:01:44.227
to show off your deep learning

00:01:44.227 --> 00:01:46.063
neural network training skills.

00:01:46.063 --> 00:01:47.402
So that's super cool.

00:01:47.402 --> 00:01:50.087
And we because due to the high interest

00:01:50.087 --> 00:01:52.811
in HyperQuest and due to
the conflicts with the,

00:01:52.811 --> 00:01:55.118
with the Milestones submission time.

00:01:55.118 --> 00:01:56.808
We decided to extend the deadline

00:01:56.808 --> 00:01:58.591
for extra credit through Sunday.

00:01:58.591 --> 00:02:02.279
So, anyone who does at
least 12 runs on HyperQuest

00:02:02.279 --> 00:02:04.773
by Sunday will get little bit
of extra credit in the class.

00:02:04.773 --> 00:02:07.394
Also those of you who are,
at the top of leader board

00:02:07.394 --> 00:02:09.175
doing really well, will
get may be little bit

00:02:09.175 --> 00:02:11.200
extra, extra credit.

00:02:11.200 --> 00:02:13.081
So, I thanks for
participating we got lot of

00:02:13.081 --> 00:02:15.935
interest and that was really cool.

00:02:15.935 --> 00:02:17.844
Final reminder is about
the poster session.

00:02:17.844 --> 00:02:21.445
So, we have the poster
session will be on June 6th.

00:02:21.445 --> 00:02:22.872
That date is finalized,

00:02:22.872 --> 00:02:24.940
I think that, I don't
remember the exact time.

00:02:24.940 --> 00:02:25.932
But it is June 6th.

00:02:25.932 --> 00:02:27.141
So that, we have some questions

00:02:27.141 --> 00:02:29.310
about when exactly that poster session is

00:02:29.310 --> 00:02:30.297
for those of you who are traveling

00:02:30.297 --> 00:02:31.897
at the end of quarter
or starting internships

00:02:31.897 --> 00:02:33.247
or something like that.

00:02:33.247 --> 00:02:35.497
So, it will be June 6th.

00:02:35.497 --> 00:02:37.210
Any questions on the admin notes.

00:02:39.241 --> 00:02:41.171
No, totally clear.

00:02:41.171 --> 00:02:42.578
So, last time we talked.

00:02:42.578 --> 00:02:44.254
So, last time we had a pretty

00:02:44.254 --> 00:02:46.259
jam packed lecture, when we
talked about lot of different

00:02:46.259 --> 00:02:48.161
computer vision tasks, as a reminder.

00:02:48.161 --> 00:02:49.955
We talked about semantic segmentation

00:02:49.955 --> 00:02:52.035
which is this problem, where
you want to sign labels

00:02:52.035 --> 00:02:54.318
to every pixel in the input image.

00:02:54.318 --> 00:02:56.131
But does not differentiate the

00:02:56.131 --> 00:02:58.225
object instances in those images.

00:02:58.225 --> 00:03:00.773
We talked about classification
plus localization.

00:03:00.773 --> 00:03:02.558
Where in addition to a class label

00:03:02.558 --> 00:03:04.059
you also want to draw a box

00:03:04.059 --> 00:03:06.539
or perhaps several boxes in the image.

00:03:06.539 --> 00:03:08.041
Where the distinction here is that,

00:03:08.041 --> 00:03:10.130
in a classification
plus localization setup.

00:03:10.130 --> 00:03:12.594
You have some fix number of
objects that you are looking for

00:03:12.594 --> 00:03:14.424
So, we also saw that this type of paradigm

00:03:14.424 --> 00:03:16.785
can be applied to the things
like pose recognition.

00:03:16.785 --> 00:03:18.836
Where you want to regress to
different numbers of joints

00:03:18.836 --> 00:03:20.222
in the human body.

00:03:20.222 --> 00:03:22.235
We also talked about the object detection

00:03:22.235 --> 00:03:23.976
where you start with some fixed

00:03:23.976 --> 00:03:25.851
set of category labels
that you are interested in.

00:03:25.851 --> 00:03:27.102
Like dogs and cats.

00:03:27.102 --> 00:03:29.460
And then the task is
to draw a boxes around

00:03:29.460 --> 00:03:31.196
every instance of those objects

00:03:31.196 --> 00:03:32.769
that appear in the input image.

00:03:32.769 --> 00:03:35.303
And object detection
is really distinct from

00:03:35.303 --> 00:03:37.063
classification plus localization

00:03:37.063 --> 00:03:38.783
because with object
detection, we don't know

00:03:38.783 --> 00:03:40.629
ahead of time, how many object instances

00:03:40.629 --> 00:03:42.298
we're looking for in the image.

00:03:42.298 --> 00:03:44.272
And we saw that there's
this whole family of methods

00:03:44.272 --> 00:03:48.100
based on RCNN, Fast RCNN and faster RCNN,

00:03:48.100 --> 00:03:49.916
as well as the single
shot detection methods

00:03:49.916 --> 00:03:52.588
for addressing this problem
of object detection.

00:03:52.588 --> 00:03:55.026
Then finally we talked
pretty briefly about

00:03:55.026 --> 00:03:57.722
instance segmentation,
which is kind of combining

00:03:57.722 --> 00:04:01.164
aspects of a semantic
segmentation and object detection

00:04:01.164 --> 00:04:03.308
where the goal is to
detect all the instances

00:04:03.308 --> 00:04:04.934
of the categories we care about,

00:04:04.934 --> 00:04:07.997
as well as label the pixels
belonging to each instance.

00:04:07.997 --> 00:04:11.339
So, in this case, we
detected two dogs and one cat

00:04:11.339 --> 00:04:13.093
and for each of those instances we wanted

00:04:13.093 --> 00:04:14.887
to label all the pixels.

00:04:14.887 --> 00:04:17.437
So, these are we kind of
covered a lot last lecture

00:04:17.437 --> 00:04:19.509
but those are really interesting
and exciting problems

00:04:19.509 --> 00:04:21.284
that you guys might consider to

00:04:21.284 --> 00:04:23.810
using in parts of your projects.

00:04:23.810 --> 00:04:25.645
But today we are going to
shift gears a little bit

00:04:25.645 --> 00:04:27.081
and ask another question.

00:04:27.081 --> 00:04:28.702
Which is, what's really going on

00:04:28.702 --> 00:04:30.578
inside convolutional networks.

00:04:30.578 --> 00:04:32.445
We've seen by this point in the class

00:04:32.445 --> 00:04:34.120
how to train convolutional networks.

00:04:34.120 --> 00:04:35.916
How to stitch up different
types of architectures

00:04:35.916 --> 00:04:37.503
to attack different problems.

00:04:37.503 --> 00:04:39.860
But one question that you
might have had in your mind,

00:04:39.860 --> 00:04:42.653
is what exactly is going
on inside these networks?

00:04:42.653 --> 00:04:44.081
How did they do the things that they do?

00:04:44.081 --> 00:04:46.444
What kinds of features
are they looking for?

00:04:46.444 --> 00:04:48.612
And all this source of related questions.

00:04:48.612 --> 00:04:51.043
So, so far we've sort of seen

00:04:51.043 --> 00:04:53.399
ConvNets as a little bit of a black box.

00:04:53.399 --> 00:04:55.635
Where some input image of raw pixels

00:04:55.635 --> 00:04:57.100
is coming in on one side.

00:04:57.100 --> 00:04:58.816
It goes to the many layers of convulsion

00:04:58.816 --> 00:05:01.170
and pooling in different
sorts of transformations.

00:05:01.170 --> 00:05:04.547
And on the outside, we end up
with some set of class scores

00:05:04.547 --> 00:05:07.363
or some types of understandable
interpretable output.

00:05:07.363 --> 00:05:09.865
Such as class scores or
bounding box positions

00:05:09.865 --> 00:05:12.342
or labeled pixels or something like that.

00:05:12.342 --> 00:05:13.307
But the question is.

00:05:13.307 --> 00:05:15.933
What are all these other
layers in the middle doing?

00:05:15.933 --> 00:05:17.685
What kinds of things in the input image

00:05:17.685 --> 00:05:18.567
are they looking for?

00:05:18.567 --> 00:05:20.857
And can we try again intuition for.

00:05:20.857 --> 00:05:22.023
How ConvNets are working?

00:05:22.023 --> 00:05:24.364
What types of things in the
image they are looking for?

00:05:24.364 --> 00:05:25.867
And what kinds of techniques do we have

00:05:25.867 --> 00:05:29.327
for analyzing this
internals of the network?

00:05:29.327 --> 00:05:32.667
So, one relatively simple
thing is the first layer.

00:05:32.667 --> 00:05:34.522
So, we've seen, we've
talked about this before.

00:05:34.522 --> 00:05:37.508
But recalled that, the
first convolutional layer

00:05:37.508 --> 00:05:39.819
consists of a filters that,

00:05:39.819 --> 00:05:41.492
so, for example in AlexNet.

00:05:41.492 --> 00:05:43.262
The first convolutional layer consists

00:05:43.262 --> 00:05:45.193
of a number of convolutional filters.

00:05:45.193 --> 00:05:49.230
Each convolutional of filter
has shape 3 by 11 by 11.

00:05:49.230 --> 00:05:51.228
And these convolutional filters gets slid

00:05:51.228 --> 00:05:52.268
over the input image.

00:05:52.268 --> 00:05:54.947
We take inner products between
some chunk of the image.

00:05:54.947 --> 00:05:56.909
And the weights of the
convolutional filter.

00:05:56.909 --> 00:05:58.689
And that gives us our output of the

00:05:58.689 --> 00:06:01.729
at, at after that first
convolutional layer.

00:06:01.729 --> 00:06:05.074
So, in AlexNet then we
have 64 of these filters.

00:06:05.074 --> 00:06:06.947
But now in the first layer
because we are taking

00:06:06.947 --> 00:06:08.780
in a direct inner product
between the weights

00:06:08.780 --> 00:06:10.175
of the convolutional layer

00:06:10.175 --> 00:06:11.682
and the pixels of the image.

00:06:11.682 --> 00:06:14.548
We can get some since for what
these filters are looking for

00:06:14.548 --> 00:06:17.697
by simply visualizing the
learned weights of these filters

00:06:17.697 --> 00:06:19.458
as images themselves.

00:06:19.458 --> 00:06:22.576
So, for each of those
11 by 11 by 3 filters

00:06:22.576 --> 00:06:25.027
in AlexNet, we can just
visualize that filter

00:06:25.027 --> 00:06:28.461
as a little 11 by 11 image
with a three channels

00:06:28.461 --> 00:06:30.201
give you the red, green and blue values.

00:06:30.201 --> 00:06:32.051
And then because there
are 64 of these filters

00:06:32.051 --> 00:06:35.305
we just visualize 64
little 11 by 11 images.

00:06:35.305 --> 00:06:38.047
And we can repeat... So
we have shown here at the.

00:06:38.047 --> 00:06:40.982
So, these are filters taken
from the prechain models,

00:06:40.982 --> 00:06:42.509
in the pi torch model zoo.

00:06:42.509 --> 00:06:44.739
And we are looking at the
convolutional filters.

00:06:44.739 --> 00:06:45.985
The weights of the convolutional filters.

00:06:45.985 --> 00:06:48.313
at the first layer of AlexNet, ResNet-18,

00:06:48.313 --> 00:06:51.065
ResNet-101 and DenseNet-121.

00:06:51.065 --> 00:06:53.753
And you can see, kind
of what all these layers

00:06:53.753 --> 00:06:55.553
what this filters looking for.

00:06:55.553 --> 00:06:59.015
You see the lot of things
looking for oriented edges.

00:06:59.015 --> 00:07:01.052
Likes bars of light and dark.

00:07:01.052 --> 00:07:04.487
At various angles, in various
angles and various positions

00:07:04.487 --> 00:07:07.200
in the input, we can see opposing colors.

00:07:07.200 --> 00:07:09.475
Like this are green and pink.

00:07:09.475 --> 00:07:12.732
opposing colors or this orange
and blue opposing colors.

00:07:12.732 --> 00:07:14.893
So, this, this kind of
connects back to what we

00:07:14.893 --> 00:07:16.221
talked about with Hugh and Wiesel.

00:07:16.221 --> 00:07:17.907
All the way in the first lecture.

00:07:17.907 --> 00:07:19.716
That remember the human visual system

00:07:19.716 --> 00:07:22.271
is known to the detect
things like oriented edges.

00:07:22.271 --> 00:07:24.978
At the very early layers
of the human visual system.

00:07:24.978 --> 00:07:26.946
And it turns out of that
these convolutional networks

00:07:26.946 --> 00:07:29.136
tend to do something, somewhat similar.

00:07:29.136 --> 00:07:31.566
At their first convolutional
layers as well.

00:07:31.566 --> 00:07:33.153
And what's kind of interesting is that

00:07:33.153 --> 00:07:35.631
pretty much no matter what type
of architecture you hook up

00:07:35.631 --> 00:07:37.920
or whatever type of training
data you are train it on.

00:07:37.920 --> 00:07:40.594
You almost always get
the first layers of your.

00:07:40.594 --> 00:07:42.736
The first convolutional
weights of any pretty much

00:07:42.736 --> 00:07:44.990
any convolutional network
looking at images.

00:07:44.990 --> 00:07:46.389
Ends up looking something like this

00:07:46.389 --> 00:07:48.676
with oriented edges and opposing colors.

00:07:48.676 --> 00:07:51.539
Looking at that input image.

00:07:51.539 --> 00:07:53.696
But this really only, sorry
what was that question?

00:08:04.215 --> 00:08:06.118
Yes, these are showing the learned weights

00:08:06.118 --> 00:08:07.592
of the first convolutional layer.

00:08:15.766 --> 00:08:16.826
Oh, so that the question is.

00:08:16.826 --> 00:08:18.998
Why does visualizing the
weights of the filters?

00:08:18.998 --> 00:08:21.318
Tell you what the filter is looking for.

00:08:21.318 --> 00:08:23.945
So this intuition comes from
sort of template matching

00:08:23.945 --> 00:08:25.045
and inner products.

00:08:25.045 --> 00:08:28.389
That if you imagine you have
some, some template vector.

00:08:28.389 --> 00:08:31.125
And then you imagine you
compute a scaler output

00:08:31.125 --> 00:08:33.272
by taking inner product
between your template vector

00:08:33.272 --> 00:08:35.044
and some arbitrary piece of data.

00:08:35.044 --> 00:08:38.321
Then, the input which
maximizes that activation.

00:08:38.321 --> 00:08:40.289
Under a norm constraint on the input

00:08:40.289 --> 00:08:43.062
is exactly when those
two vectors match up.

00:08:43.062 --> 00:08:45.564
So, in that since that,
when, whenever you're taking

00:08:45.564 --> 00:08:48.066
inner products, the thing
causes an inner product

00:08:48.066 --> 00:08:49.736
to excite maximally

00:08:49.736 --> 00:08:52.506
is a copy of the thing you are
taking an inner product with.

00:08:52.506 --> 00:08:55.060
So, that, that's why we can
actually visualize these weights

00:08:55.060 --> 00:08:56.323
and that, why that shows us,

00:08:56.323 --> 00:08:57.902
what this first layer is looking for.

00:09:06.008 --> 00:09:08.731
So, for these networks
the first layers always

00:09:08.731 --> 00:09:10.052
was a convolutional layer.

00:09:10.052 --> 00:09:12.003
So, generally whenever
you are looking at image.

00:09:12.003 --> 00:09:13.808
Whenever you are thinking about image data

00:09:13.808 --> 00:09:15.174
and training convolutional networks,

00:09:15.174 --> 00:09:16.525
you generally put a convolutional layer

00:09:16.525 --> 00:09:18.178
at the first, at the first stop.

00:09:28.086 --> 00:09:29.006
Yeah, so the question is,

00:09:29.006 --> 00:09:30.665
can we do this same type of procedure

00:09:30.665 --> 00:09:32.118
in the middle open network.

00:09:32.118 --> 00:09:33.202
That's actually the next slide.

00:09:33.202 --> 00:09:35.104
So, good anticipation.

00:09:35.104 --> 00:09:37.123
So, if we do, if we draw this exact same

00:09:37.123 --> 00:09:39.767
visualization for the
intermediate convolutional layers.

00:09:39.767 --> 00:09:41.753
It's actually a lot less interpretable.

00:09:41.753 --> 00:09:45.081
So, this is, this is performing
exact same visualization.

00:09:45.081 --> 00:09:49.278
So, remember for this using
the tiny ConvNets demo network

00:09:49.278 --> 00:09:50.474
that's running on the course website

00:09:50.474 --> 00:09:51.890
whenever you go there.

00:09:51.890 --> 00:09:52.702
So, for that network,

00:09:52.702 --> 00:09:55.987
the first layer is 7 by
7 convulsion 16 filters.

00:09:55.987 --> 00:09:58.263
So, after the top visualizing
the first layer weights

00:09:58.263 --> 00:10:00.842
for this network just like
we saw in a previous slide.

00:10:00.842 --> 00:10:02.366
But now at the second layer weights.

00:10:02.366 --> 00:10:04.491
After we do a convulsion
then there's some relu

00:10:04.491 --> 00:10:06.583
and some other non-linearity perhaps.

00:10:06.583 --> 00:10:08.185
But the second convolutional layer,

00:10:08.185 --> 00:10:10.629
now receives the 16 channel input.

00:10:10.629 --> 00:10:15.116
And does 7 by 7 convulsion
with 20 convolutional filters.

00:10:15.116 --> 00:10:16.064
And we've actually,

00:10:16.064 --> 00:10:18.660
so the problem is that
you can't really visualize

00:10:18.660 --> 00:10:20.495
these directly as images.

00:10:20.495 --> 00:10:23.846
So, you can try, so, here if you

00:10:23.846 --> 00:10:28.547
this 16 by, so the input is
this has 16 dimensions in depth.

00:10:28.547 --> 00:10:30.286
And we have these convolutional filters,

00:10:30.286 --> 00:10:32.542
each convolutional filter is 7 by 7,

00:10:32.542 --> 00:10:34.388
and is extending along the full depth

00:10:34.388 --> 00:10:35.759
so has 16 elements.

00:10:35.759 --> 00:10:38.072
Then we've 20 such of these
convolutional filters,

00:10:38.072 --> 00:10:40.924
that are producing the output
planes of the next layer.

00:10:40.924 --> 00:10:44.035
But the problem here is that
we can't, looking at the,

00:10:44.035 --> 00:10:45.128
looking directly at the weights

00:10:45.128 --> 00:10:47.498
of these filters, doesn't
really tell us much.

00:10:47.498 --> 00:10:49.734
So, we, that's really done here is that,

00:10:49.734 --> 00:10:53.743
now for this single 16 by 7
by 7 convolutional filter.

00:10:53.743 --> 00:10:58.192
We can spread out those 167
by 7 planes of the filter

00:10:58.192 --> 00:11:01.782
into a 167 by 7 grayscale images.

00:11:01.782 --> 00:11:03.284
So, that's what we've done.

00:11:03.284 --> 00:11:07.095
Up here, which is these little
tiny gray scale images here

00:11:07.095 --> 00:11:08.898
show us what is, what are the weights

00:11:08.898 --> 00:11:11.852
in one of the convolutional
filters of the second layer.

00:11:11.852 --> 00:11:14.473
And now, because there are
20 outputs from this layer.

00:11:14.473 --> 00:11:17.534
Then this second convolutional
layer, has 2o such of these

00:11:17.534 --> 00:11:21.046
16 by 16 or 16 by 7 by 7 filters.

00:11:21.046 --> 00:11:22.871
So if we visualize the weights

00:11:22.871 --> 00:11:24.307
of those convolutional filters

00:11:24.307 --> 00:11:26.709
as images, you can see that there are some

00:11:26.709 --> 00:11:28.638
kind of spacial structures here.

00:11:28.638 --> 00:11:30.897
But it doesn't really
give you good intuition

00:11:30.897 --> 00:11:32.128
for what they are looking at.

00:11:32.128 --> 00:11:35.099
Because these filters are not
looking, are not connected

00:11:35.099 --> 00:11:36.644
directly to the input image.

00:11:36.644 --> 00:11:39.493
Instead recall that the second
layer convolutional filters

00:11:39.493 --> 00:11:41.851
are connected to the
output of the first layer.

00:11:41.851 --> 00:11:44.189
So, this is giving visualization of,

00:11:44.189 --> 00:11:46.684
what type of activation
pattern after the first

00:11:46.684 --> 00:11:49.331
convulsion, would cause
the second layer convulsion

00:11:49.331 --> 00:11:50.646
to maximally activate.

00:11:50.646 --> 00:11:52.423
But, that's not very interpretable

00:11:52.423 --> 00:11:53.860
because we don't have a good sense

00:11:53.860 --> 00:11:55.966
for what those first layer
convulsions look like

00:11:55.966 --> 00:11:58.490
in terms of image pixels.

00:11:58.490 --> 00:12:00.893
So we'll need to develop some
slightly more fancy technique

00:12:00.893 --> 00:12:02.047
to get a sense for what is going on

00:12:02.047 --> 00:12:03.556
in the intermediate layers.

00:12:03.556 --> 00:12:04.819
Question in the back.

00:12:09.189 --> 00:12:10.489
Yeah. So the question is that

00:12:10.489 --> 00:12:13.456
for... all the visualization
on this on the previous slide.

00:12:13.456 --> 00:12:16.552
We've had the scale the weights
to the zero to 255 range.

00:12:16.552 --> 00:12:18.648
So in practice those
weights could be unbounded.

00:12:18.648 --> 00:12:19.885
They could have any range.

00:12:19.885 --> 00:12:22.983
But to get nice visualizations
we need to scale those.

00:12:22.983 --> 00:12:24.685
These visualizations also do not take

00:12:24.685 --> 00:12:26.409
in to account the bias is in these layers.

00:12:26.409 --> 00:12:28.162
So you should keep that in mind

00:12:28.162 --> 00:12:30.423
when and not take these
HEPS visualizations

00:12:30.423 --> 00:12:31.892
to, to literally.

00:12:34.180 --> 00:12:35.237
Now at the last layer

00:12:35.237 --> 00:12:36.733
remember when we looking at the last layer

00:12:36.733 --> 00:12:38.391
of convolutional network.

00:12:38.391 --> 00:12:40.698
We have these maybe 1000 class scores

00:12:40.698 --> 00:12:42.891
that are telling us what
are the predicted scores

00:12:42.891 --> 00:12:44.908
for each of the classes
in our training data set

00:12:44.908 --> 00:12:46.676
and immediately before the last layer

00:12:46.676 --> 00:12:48.628
we often have some fully connected layer.

00:12:48.628 --> 00:12:49.962
In the case of Alex net

00:12:49.962 --> 00:12:53.039
we have some 4096- dimensional
features representation

00:12:53.039 --> 00:12:55.516
of our image that then
gets fed into that final

00:12:55.516 --> 00:12:58.328
our final layer to predict
our final class scores.

00:12:58.328 --> 00:13:00.606
And one another, another kind of route

00:13:00.606 --> 00:13:02.787
for tackling the problem
of visual, visualizing

00:13:02.787 --> 00:13:04.263
and understanding ConvNets

00:13:04.263 --> 00:13:06.520
is to try to understand what's
happening at the last layer

00:13:06.520 --> 00:13:07.967
of a convolutional network.

00:13:07.967 --> 00:13:09.022
So what we can do

00:13:09.022 --> 00:13:11.230
is how to take some,
some data set of images

00:13:11.230 --> 00:13:13.110
run a bunch of, run a bunch of images

00:13:13.110 --> 00:13:14.815
through our trained convolutional network

00:13:14.815 --> 00:13:17.174
and recorded that 4096 dimensional vector

00:13:17.174 --> 00:13:18.687
for each of those images.

00:13:18.687 --> 00:13:20.722
And now go through and try to figure out

00:13:20.722 --> 00:13:23.219
and visualize that last
layer, that last hidden layer

00:13:23.219 --> 00:13:26.075
rather than those rather than
the first convolutional layer.

00:13:26.075 --> 00:13:27.804
So, one thing you might imagine is,

00:13:27.804 --> 00:13:29.791
is trying a nearest neighbor approach.

00:13:29.791 --> 00:13:31.559
So, remember, way back
in the second lecture

00:13:31.559 --> 00:13:33.162
we saw this graphic on the left

00:13:33.162 --> 00:13:36.045
where we, where we had a
nearest neighbor classifier.

00:13:36.045 --> 00:13:37.967
Where we were looking at
nearest neighbors in pixels

00:13:37.967 --> 00:13:40.303
space between CIFAR 10 images.

00:13:40.303 --> 00:13:41.996
And then when you look
at nearest neighbors

00:13:41.996 --> 00:13:44.765
in pixel space between CIFAR 10 images

00:13:44.765 --> 00:13:46.500
you see that you pull up images

00:13:46.500 --> 00:13:48.660
that looks quite similar
to the query image.

00:13:48.660 --> 00:13:50.777
So again on the left column
here is some CIFAR 10 image

00:13:50.777 --> 00:13:52.350
from the CIFAR 10 data set

00:13:52.350 --> 00:13:54.987
and then these, these next five columns

00:13:54.987 --> 00:13:57.239
are showing the nearest
neighbors in pixel space

00:13:57.239 --> 00:13:58.917
to those test set images.

00:13:58.917 --> 00:14:00.185
And so for example

00:14:00.185 --> 00:14:02.446
this white dog that you see here,

00:14:02.446 --> 00:14:04.523
it's nearest neighbors are in pixel space

00:14:04.523 --> 00:14:06.328
are these kinds of white blobby things

00:14:06.328 --> 00:14:08.321
that may, may or may not be dogs,

00:14:08.321 --> 00:14:09.885
but at least the raw pixels

00:14:09.885 --> 00:14:11.643
of the image are quite similar.

00:14:11.643 --> 00:14:14.268
So now we can do the same
type of visualization

00:14:14.268 --> 00:14:16.937
computing and visualizing
these nearest neighbor images.

00:14:16.937 --> 00:14:17.963
But rather than computing

00:14:17.963 --> 00:14:19.952
the nearest neighbors in pixel space,

00:14:19.952 --> 00:14:21.735
instead we can compute nearest neighbors

00:14:21.735 --> 00:14:24.507
in that 4096 dimensional feature space.

00:14:24.507 --> 00:14:27.107
Which is computed by the
convolutional network.

00:14:27.107 --> 00:14:28.351
So here on the right

00:14:28.351 --> 00:14:29.987
we see some examples.

00:14:29.987 --> 00:14:32.069
So this, this first column shows us

00:14:32.069 --> 00:14:34.924
some examples of images from the test set

00:14:34.924 --> 00:14:38.338
of image that... Of the image
net classification data set

00:14:38.338 --> 00:14:41.253
and now the, these
subsequent columns show us

00:14:41.253 --> 00:14:43.614
nearest neighbors to those test set images

00:14:43.614 --> 00:14:46.863
in the 4096, in the 4096th
dimensional features space

00:14:46.863 --> 00:14:48.515
computed by Alex net.

00:14:48.515 --> 00:14:51.010
And you can see here that
this is quite different

00:14:51.010 --> 00:14:52.941
from the pixel space nearest neighbors,

00:14:52.941 --> 00:14:55.086
because the pixels are
often quite different.

00:14:55.086 --> 00:14:57.111
between the image in
it's nearest neighbors

00:14:57.111 --> 00:14:58.375
and feature space.

00:14:58.375 --> 00:15:03.031
However, the semantic content of those images
tends to be similar in this feature space.

00:15:03.031 --> 00:15:10.484
So for example, if you look at this second layer the query image is this
elephant standing on the left side of the image with a screen grass behind him.

00:15:10.484 --> 00:15:17.307
and now one of these, one of these... it's third nearest neighbor in the
tough set is actually an elephant standing on the right side of the image.

00:15:17.307 --> 00:15:26.942
So this is really interesting. Because between this elephant standing on the left and this element
stand, elephant standing on the right the pixels between those two images are almost entirely different.

00:15:26.942 --> 00:15:32.554
However, in the feature space which is learned by the network
those two images and that being very close to each other.

00:15:32.554 --> 00:15:37.975
Which means that somehow this, this last their features is
capturing some of those semantic content of these images.

00:15:37.975 --> 00:15:46.192
That's really cool and really exciting and, and in general looking at these kind of nearest neighbor
visualizations is really quick and easy way to visualize something about what's going on here.

00:16:02.617 --> 00:16:04.630
Yes. So the question is that

00:16:04.630 --> 00:16:13.942
through the... the standard supervised learning procedure for classific training, classification
network There's nothing in the loss encouraging these features to be close together.

00:16:13.942 --> 00:16:21.476
So that, that's true. It just kind of a happy accident that they end up being close to each
other. Because we didn't tell the network during training these features should be close.

00:16:21.476 --> 00:16:28.746
However there are sometimes people do train networks using
things called either contrastive loss or a triplet loss.

00:16:28.746 --> 00:16:37.253
Which actually explicitly make... assumptions and constraints on the network such
that those last their features end up having some metric space interpretation.

00:16:37.253 --> 00:16:39.907
But Alex net at least was not
trained specifically for that.

00:16:44.931 --> 00:16:46.060
The question is, what is the nearest...

00:16:46.060 --> 00:16:48.875
What is this nearest neighbor thing
have to do at the last layer?

00:16:48.875 --> 00:16:51.432
So we're taking this image
we're running it through the network

00:16:51.432 --> 00:16:57.670
and then the, the second to last like the last hidden
layer of the network is of 4096th dimensional vector.

00:16:57.670 --> 00:17:01.797
Because there's this, this is... This is there, there are
these fully connected layers at the end of the network.

00:17:01.797 --> 00:17:06.893
So we are doing is... We're writing down that
4096th dimensional vector for each of the images

00:17:06.894 --> 00:17:12.966
and then we are computing nearest neighbors according to that 4096th
dimensional vector. Which is computed by, computed by the network.

00:17:17.012 --> 00:17:19.171
Maybe, maybe we can chat offline.

00:17:19.171 --> 00:17:28.434
So another, another, another another angle that we might have for visualizing
what's going on in this last layer is by some concept of dimensionality reduction.

00:17:28.435 --> 00:17:33.220
So those of you who have taken CS229 for
example you've seen something like PCA.

00:17:33.220 --> 00:17:39.841
Which let's you take some high dimensional representation like these
4096th dimensional features and then compress it down to two-dimensions.

00:17:39.841 --> 00:17:43.183
So then you can visualize that
feature space more directly.

00:17:43.183 --> 00:17:51.321
So, Principle Component Analysis or PCA is kind of one way to do that.
But there's real another really powerful algorithm called t-SNE.

00:17:51.321 --> 00:17:54.656
Standing for t-distributed
stochastic neighbor embeddings.

00:17:54.656 --> 00:18:03.137
Which is slightly more powerful method. Which is a non-linear dimensionality
reduction method that people in deep often use for visualizing features.

00:18:03.137 --> 00:18:07.264
So here as an, just an
example of what t-SNE can do.

00:18:07.264 --> 00:18:13.231
This visualization here is, is showing a t-SNE
dimensionality reduction on the emnest data set.

00:18:13.231 --> 00:18:17.521
So, emnest remember is this date set of
hand written digits between zero and nine.

00:18:17.521 --> 00:18:22.226
Each image is a gray scale image
20... 28 by 28 gray scale image

00:18:22.226 --> 00:18:32.020
and now we're... So that Now we've, we've used t-SNE to take that 28 times 28 dimensional
features space of the raw pixels for m-nest and now compress it down to two- dimensions

00:18:32.020 --> 00:18:37.096
ans then visualize each of those m-nest digits
in this compress two-dimensional representation

00:18:37.096 --> 00:18:42.653
and when you do, when you run t-SNE on the raw pixels and
m-nest You can see these natural clusters appearing.

00:18:42.653 --> 00:18:47.532
Which corresponds to the, the digits of
these m-nest of, of these m-nest data set.

00:18:47.532 --> 00:18:57.348
So now we can do a similar type of visualization. Where we apply this t-SNE dimensionality
reduction technique to the features from the last layer of our trained image net classifier.

00:18:57.348 --> 00:19:05.073
So...To be a little bit more concrete here what we've done is that we
take, a large set of images we run them off convolutional network.

00:19:05.073 --> 00:19:10.865
We record that final 4096th dimensional feature vector
for, from the last layer of each of those images.

00:19:10.865 --> 00:19:14.756
Which gives us large collection
of 4096th dimensional vectors.

00:19:14.756 --> 00:19:24.277
Now we apply t-SNE dimensionality reduction to compute, sort of compress that
4096the dimensional features space down into a two-dimensional feature space

00:19:24.277 --> 00:19:36.415
and now we, layout a grid in that compressed two-dimensional feature space and visualize what
types of images appear at each location in the grid in this two-dimensional feature space.

00:19:36.415 --> 00:19:43.417
So by doing this you get some very close rough sense of
what the geometry of this learned feature space looks like.

00:19:43.417 --> 00:19:48.620
So these images are little bit hard to see. So I'd encourage
you to check out the high resolution versions online.

00:19:48.620 --> 00:19:56.451
But at least maybe on the left you can see that there's sort of one cluster
in the bottom here of, of green things, is a different kind of flowers

00:19:56.451 --> 00:20:01.800
and there's other types of clusters for different types of
dog breeds and another types of animals and, and locations.

00:20:01.800 --> 00:20:06.192
So there's sort of discontinuous
semantic notion in this feature space.

00:20:06.192 --> 00:20:11.597
Which we can explore by looking through this t-SNE
dimensionality reduction version of the, of the features.

00:20:11.597 --> 00:20:12.604
Is there question?

00:20:23.716 --> 00:20:29.793
Yeah. So the basic idea is that we're we, we have an image so now we
end up with three different pieces of information about each image.

00:20:29.793 --> 00:20:31.308
We have the pixels of the image.

00:20:31.308 --> 00:20:33.353
We have the 4096th dimensional vector.

00:20:33.353 --> 00:20:38.109
Then we use t-SNE to convert the 4096th dimensional
vector into a two-dimensional coordinate

00:20:38.109 --> 00:20:49.547
and then we take the original pixels of the image and place that at the two-dimensional coordinate corresponding
to the dimensionality reduced version of the 4096th dimensional feature. Yeah, little bit involved here.

00:20:49.547 --> 00:20:50.348
Question in the front.

00:20:55.864 --> 00:20:59.255
The question is Roughly how much
variants do these two-dimension explain?

00:20:59.255 --> 00:21:06.080
Well, I'm not sure of the exact number and I get little bit muddy when you're
talking about t-SNE, because it's a non-linear dimensionality reduction technique.

00:21:06.080 --> 00:21:10.259
So, I'd have to look offline and I'm not
sure of exactly how much it explains.

00:21:10.259 --> 00:21:14.377
Question?

00:21:14.377 --> 00:21:17.038
Question is, can you do the same analysis
of upper layers of the network?

00:21:17.038 --> 00:21:21.384
And yes, you can. But no, I don't have
those visualizations here. Sorry.

00:21:21.384 --> 00:21:24.603
Question?

00:21:35.559 --> 00:21:39.482
The question is, Shouldn't we have overlaps of
images once we do this dimensionality reduction?

00:21:39.482 --> 00:21:40.902
And yes, of course, you would.

00:21:40.902 --> 00:21:47.537
So this is just kind of taking a, nearest neighbor in our, in our
regular grid and then picking an image close to that grid point.

00:21:47.537 --> 00:21:54.792
So, so... they, yeah. this is not showing you the kind
of density in different parts of the feature space.

00:21:54.792 --> 00:22:03.122
So that's, that's another thing to look at and again at the link you, there's a
couple more visualizations of this nature that, that address that a little bit.

00:22:03.122 --> 00:22:07.713
Okay. So another, another thing that you can
do for some of these intermediate features

00:22:07.713 --> 00:22:13.856
is, so we talked a couple of slides ago that visualizing the
weights of these intermediate layers is not so interpretable.

00:22:13.856 --> 00:22:20.846
But actually visualizing the activation maps of those
intermediate layers is kind of interpretable in some cases.

00:22:20.846 --> 00:22:28.603
So for, so I, again an example of Alex Net. Remember the,
the conv5 layers of Alex Net. Gives us this 128 by...

00:22:28.603 --> 00:22:35.668
The for...The conv5 features for any image
is now 128 by 13 by 13 dimensional tensor.

00:22:35.668 --> 00:22:42.386
But we can think of that as 128
different 13 by 132-D grids.

00:22:42.386 --> 00:22:49.741
So now we can actually go and visualize each of those 13 by
13 elements slices of the feature map as a grayscale image

00:22:49.741 --> 00:22:58.501
and this gives us some sense for what types of things in the input
are each of those features in that convolutional layer looking for.

00:22:58.501 --> 00:23:03.306
So this is a, a really cool interactive tool
by Jason Yasenski you can just download.

00:23:03.306 --> 00:23:06.598
So it's run, so I don't have the video,
it has a video on his website.

00:23:06.598 --> 00:23:10.059
But it's running a convolutional network
on the inputs stream of webcam

00:23:10.059 --> 00:23:17.279
and then visualizing in real time each of those slices of that
intermediate feature map give you a sense of what it's looking for

00:23:17.279 --> 00:23:23.931
and you can see that, so here the input image is this, this picture
up in, settings... of this picture of a person in front of the camera

00:23:23.931 --> 00:23:28.192
and most of these intermediate features
are kind of noisy, not much going on.

00:23:28.192 --> 00:23:34.277
But there's a, but there's this one highlighted
intermediate feature where that is also shown larger here

00:23:34.277 --> 00:23:41.103
that seems that it's activating on the portions of the feature map
corresponding to the person's face. Which is really interesting

00:23:41.103 --> 00:23:51.045
and that kind of, suggests that maybe this, this particular slice of the feature map of this
layer of this particular network is maybe looking for human faces or something like that.

00:23:51.045 --> 00:23:54.132
Which is kind of a nice, kind of a nice
and cool finding.

00:23:54.132 --> 00:23:55.517
Question?

00:23:59.038 --> 00:24:04.957
The question is, Are the black activations dead relu's?
So you got to be... a little careful with terminology.

00:24:04.957 --> 00:24:09.539
We usually say dead relu to mean something
that's dead over the entire training data set.

00:24:09.539 --> 00:24:14.701
Here I would say that it's a relu, that,
it's not active for this particular input.

00:24:14.701 --> 00:24:15.702
Question?

00:24:19.457 --> 00:24:22.538
The question is, If there's no humans in
image net how can it recognize a human face?

00:24:22.538 --> 00:24:24.182
There definitely are humans in image net

00:24:24.182 --> 00:24:29.020
I don't think it's, it's one of the cat... I don't think it's one
of the thousand categories for the classification challenge.

00:24:29.020 --> 00:24:34.906
But people definitely appear in a lot of these images and that
can be useful signal for detecting other types of things.

00:24:34.906 --> 00:24:41.617
So that's actually kind of nice results because that shows that, it's sort
of can learn features that are useful for the classification task at hand.

00:24:41.617 --> 00:24:47.483
That are even maybe a little bit different from the explicit classification
task that we told it to perform. So it's actually really cool results.

00:24:50.346 --> 00:24:51.929
Okay, question?

00:24:55.192 --> 00:25:03.334
So at each layer in the convolutional network our input image is of three,
it's like 3 by 224 by 224 and then it goes through many stages of convolution.

00:25:03.334 --> 00:25:07.731
And then, it, after each convolutional layer
is some three dimensional chunk of numbers.

00:25:07.731 --> 00:25:10.476
Which are the outputs from that layer
of the convolutional network.

00:25:10.476 --> 00:25:18.155
And that into the entire three dimensional chunk of numbers which are the output
of the previous convolutional layer, we call, we call, like an activation volume

00:25:18.155 --> 00:25:22.156
and then one of those, one of those slices
is a, it's an activation map.

00:25:34.426 --> 00:25:38.513
So the question is, If the image is K by K
will the activation map be K by K?

00:25:38.513 --> 00:25:42.489
Not always because there can be sub sampling
due to pool, straight convolution and pooling.

00:25:42.489 --> 00:25:47.756
But in general, the, the size of each activation
map will be linear in the size of the input image.

00:25:50.492 --> 00:25:55.625
So another, another kind of useful thing we can
do for visualizing intermediate features is...

00:25:55.625 --> 00:26:03.453
Visualizing what types of patches from input images cause maximal
activation in different, different features, different neurons.

00:26:03.453 --> 00:26:08.605
So what we've done here is that, we pick...
Maybe again the con five layer from Alex Net?

00:26:08.605 --> 00:26:10.926
And remember each of
these activation volumes

00:26:10.926 --> 00:26:15.738
at the con, at the con five in Alex net gives
us a 128 by 13 by 13 chunk of numbers.

00:26:15.738 --> 00:26:19.644
Then we'll pick one of those 128 channels.
Maybe channel 17

00:26:19.644 --> 00:26:23.749
and now what we'll do is run many images
through this convolutional network.

00:26:23.749 --> 00:26:27.456
And then, for each of those images
record the con five features

00:26:27.456 --> 00:26:37.925
and then look at the... Right, so, then, then look at the, the... The parts of
that 17th feature map that are maximally activated over our data set of images.

00:26:37.925 --> 00:26:45.161
And now, because again this is a convolutional layer each of those neurons
in the convolutional layer has some small receptive field in the input.

00:26:45.161 --> 00:26:49.239
Each of those neurons is not looking at the whole image.
They're only looking at the sub set of the image.

00:26:49.239 --> 00:27:00.731
Then what we'll do is, is visualize the patches from the, from this large data set of images corresponding
to the maximal activations of that, of that feature, of that particular feature in that particular layer.

00:27:00.731 --> 00:27:06.177
And then we can sorts these out, sort these patches by
their activation at that, at that particular layer.

00:27:06.177 --> 00:27:12.575
So here is a, some examples from this... Network
called a, fully... The network doesn't matter.

00:27:12.575 --> 00:27:16.380
But these are some visualizations of these
kind of maximally activating patches.

00:27:16.380 --> 00:27:22.500
So, each, each row gives... We've chosen one layer
from or one neuron from one layer of a network

00:27:22.500 --> 00:27:28.280
and then each, and then, the, they're sorted of these
are the patches from some large data set of images.

00:27:28.280 --> 00:27:30.611
That maximally activated this one neuron.

00:27:30.611 --> 00:27:35.698
And these can give you a sense for what type of
features these, these neurons might be looking for.

00:27:35.698 --> 00:27:39.998
So for example, this top row we see a lot
of circly kinds of things in the image.

00:27:39.998 --> 00:27:44.621
Some eyes, some, mostly eyes.
But also this, kind of blue circly region.

00:27:44.621 --> 00:27:51.303
So then, maybe this, this particular neuron in this particular layer of
this network is looking for kind of blue circly things in the input.

00:27:51.303 --> 00:27:56.200
Or maybe in the middle here we have neurons
that are looking for text in different colors

00:27:56.200 --> 00:28:02.201
or, or maybe curving, curving edges
of different colors and orientations.

00:28:06.246 --> 00:28:09.199
Yeah, so, I've been a little bit loose
with terminology here.

00:28:09.199 --> 00:28:13.970
So, I'm saying that a neuron is one scaler
value in that con five activation map.

00:28:13.970 --> 00:28:19.283
But because it's convolutional, all the neurons
in one channel are all using the same weights.

00:28:19.283 --> 00:28:26.451
So we've chosen one channel and then, right? So, you get a lot
of neurons for each convolutional filter at any one layer.

00:28:26.451 --> 00:28:32.532
So, we, we could have been, so this patches could've been drawn from
anywhere in the image due to the convolutional nature of the thing.

00:28:32.532 --> 00:28:38.721
And now at the bottom we also see some maximally activating
patches for neurons from a higher up layer in the same network.

00:28:38.721 --> 00:28:42.294
And now because they are coming from higher in
the network they have a larger receptive field.

00:28:42.294 --> 00:28:44.851
So, they're looking at larger
patches of the input image

00:28:44.851 --> 00:28:49.213
and we can also see that they're looking for
maybe larger structures in the input image.

00:28:49.213 --> 00:28:56.445
So this, this second row is maybe looking, it seems
to be looking for human, humans or maybe human faces.

00:28:56.445 --> 00:29:06.410
We have maybe something looking for... Parts of cameras or different types
of larger, larger, larger object like type things, types of things.

00:29:06.410 --> 00:29:11.885
Another, another cool experiment we can do which
comes from Zeiler and Fergus ECCV 2014 paper.

00:29:11.885 --> 00:29:14.062
is this idea of an exclusion experiment.

00:29:14.062 --> 00:29:21.659
So, what we want to do is figure out which parts of the input, of the
input image cause the network to make it's classification decision.

00:29:21.659 --> 00:29:25.339
So, what we'll do is, we'll take our
input image in this case an elephant

00:29:25.339 --> 00:29:32.486
and then we'll block out some part of that, some region in that input
image and just replace it with the mean pixel value from the data set.

00:29:32.486 --> 00:29:39.583
And now, run that occluded image throughout, through the network and
then record what is the predicted probability of this occluded image?

00:29:39.583 --> 00:29:44.752
And now slide this occluded patch over every position
in the input image and then repeat the same process.

00:29:44.752 --> 00:29:53.699
And then draw this heat map showing, what was the predicted probability output from
the network as a function of where did, which part of the input image did we occlude?

00:29:53.699 --> 00:29:59.952
And the idea is that if when we block out some part of the
image if that causes the network score to change drastically.

00:29:59.952 --> 00:30:04.809
Then probably that part of the input image was
really important for the classification decision.

00:30:04.809 --> 00:30:11.420
So here we've shown... I've shown three different
examples of... Of this occlusion type experiment.

00:30:11.420 --> 00:30:14.456
So, maybe this example of
a Go-kart at the bottom,

00:30:14.456 --> 00:30:23.077
you can see over here that when we, so here, red, the, the red corresponds to
a low probability and the white and yellow corresponds to a high probability.

00:30:23.077 --> 00:30:30.348
So when we block out the region of the image corresponding to this Go-kart
in front. Then the predicted probability for the Go-kart class drops a lot.

00:30:30.348 --> 00:30:38.419
So that gives us some sense that the network is actually caring a lot about these,
these pixels in the input image in order to make it's classification decision.

00:30:38.419 --> 00:30:39.589
Question?

00:30:47.473 --> 00:30:49.780
Yes, the question is that,
what's going on in the background?

00:30:49.780 --> 00:30:56.020
So maybe if the image is a little bit too small to tell but, there's, this is
actually a Go-kart track and there's a couple other Go-karts in the background.

00:30:56.020 --> 00:31:00.395
So I think that, when you're blocking out these other
Go-karts in the background, that's also influencing the score

00:31:00.395 --> 00:31:04.628
or maybe like the horizon is there and maybe the
horizon is an useful feature for detecting Go-karts,

00:31:04.628 --> 00:31:08.976
it's a little bit hard to tell sometimes.
But this is a pretty cool visualization.

00:31:08.976 --> 00:31:10.118
Yeah, was there another question?

00:31:20.486 --> 00:31:23.500
So the question is, sorry,
sorry, what was the first question?

00:31:30.731 --> 00:31:36.802
So, the, so the question... So for, for this example we're
taking one image and then masking all parts of one image.

00:31:36.802 --> 00:31:38.777
The second question
was, how is this useful?

00:31:38.777 --> 00:31:42.982
It's not, maybe, you don't really take this information
and then loop it directly into the training process.

00:31:42.982 --> 00:31:49.341
Instead, this is a way, a tool for humans to understand,
what types of computations these train networks are doing.

00:31:49.341 --> 00:31:54.296
So it's more for your understanding
than for improving performance per se.

00:31:54.296 --> 00:31:57.890
So another, another related idea
is this concept of a Saliency Map.

00:31:57.890 --> 00:32:00.534
Which is something that you
will see in your homeworks.

00:32:00.534 --> 00:32:02.578
So again, we have the same question

00:32:02.578 --> 00:32:07.831
of given an input image of a dog in this
case and the predicted class label of dog

00:32:07.831 --> 00:32:11.796
we want to know which pixels in the input
image are important for classification.

00:32:11.796 --> 00:32:19.452
We saw masking, is one way to get at this question. But Saliency
Maps are another, another, angle for attacking this problem.

00:32:19.452 --> 00:32:25.354
And the question is, and one relatively simple idea
from Karen Simonenian's paper, a couple years ago.

00:32:25.354 --> 00:32:31.694
Is, this is just computing the gradient of the predicted
class score with respect to the pixels of the input image.

00:32:31.694 --> 00:32:36.042
And this will directly tell us in this
sort of, first order approximation sense.

00:32:36.042 --> 00:32:43.963
For each input, for each pixel in the input image if we wiggle that pixel a
little bit then how much will the classification score for the class change?

00:32:43.963 --> 00:32:50.496
And this is another way to get at this question of which
pixels in the input matter for the classification.

00:32:50.496 --> 00:32:59.356
And when we, and when we run for example Saliency, where computer Saliency
map for this dog, we see kind of a nice outline of a dog in the image.

00:32:59.356 --> 00:33:04.985
Which tells us that these are probably the pixels of
that, network is actually looking at, for this image.

00:33:04.985 --> 00:33:11.675
And when we repeat this type of process for different images, we get
some sense that the network is sort of looking at the right regions.

00:33:11.675 --> 00:33:13.360
Which is somewhat comforting.

00:33:13.360 --> 00:33:14.462
Question?

00:33:17.407 --> 00:33:21.916
The question is, do people use Saliency Maps
for semantic segmentation? The answer is yes.

00:33:21.916 --> 00:33:26.741
That actually was... Yeah, you guys are
like really on top of it this lecture.

00:33:26.741 --> 00:33:29.513
So that was another component,
again in Karen's paper.

00:33:29.513 --> 00:33:38.925
Where there's this idea that maybe you can use these Saliency Maps to perform semantic
segmentation without direct, without any labeled data for the, for these, for these segments.

00:33:38.925 --> 00:33:43.908
So here they're using this Grabcut Segmentation Algorithm
which I don't really want to get into the details of.

00:33:43.908 --> 00:33:47.772
But it's kind of an interactive
segmentation algorithm that you can use.

00:33:47.772 --> 00:33:55.697
So then when you combine this Saliency Map with this Grabcut Segmentation
Algorithm then you can in fact, sometimes segment out the object in the image.

00:33:55.697 --> 00:34:00.326
Which is really cool. However I'd like to
point out that this is a little bit brittle

00:34:00.326 --> 00:34:07.182
and in general if you, this will probably work much, much, much, worse
than a network which did have access to supervision and training time.

00:34:07.182 --> 00:34:13.458
So, I don't, I'm not sure how, how practical this
is. But it is pretty cool that it works at all.

00:34:13.458 --> 00:34:19.025
But it probably works much less than something
trained explicitly to segment with supervision.

00:34:19.025 --> 00:34:23.791
So kind of another, another related idea is
this idea of, of guided back propagation.

00:34:23.791 --> 00:34:30.001
So again, we still want to answer the question
of for one particular, for one particular image.

00:34:30.001 --> 00:34:37.420
Then now instead of looking at the class score we want to know, we
want to pick some intermediate neuron in the network and ask again,

00:34:37.420 --> 00:34:44.199
which parts of the input image influence the score
of that neuron, that internal neuron in the network.

00:34:44.199 --> 00:34:49.059
And, and then you could imagine, again you could imagine
computing a Saliency Map for this again, right?

00:34:49.059 --> 00:34:53.466
That rather than computing the gradient of the class
scores with respect to the pixels of the image.

00:34:53.466 --> 00:34:58.815
You could compute the gradient of some intermediate value
in the network with respect to the pixels of the image.

00:34:58.815 --> 00:35:05.832
And that would tell us again which parts, which pixels in the
input image influence that value of that particular neuron.

00:35:05.832 --> 00:35:08.342
And that would be using
normal back propagation.

00:35:08.342 --> 00:35:15.093
But it turns out that there is a slight tweak that we can do to this back
propagation procedure that ends up giving some slightly cleaner images.

00:35:15.093 --> 00:35:21.393
So that's this idea of guided back propagation that
again comes from Zeiler and Fergus's 2014 paper.

00:35:21.393 --> 00:35:24.203
And I don't really want to get
into the details too much here

00:35:24.203 --> 00:35:30.220
but, it, you just, it's kind of weird tweak where you change
the way that you back propagate through relu non-linearities.

00:35:30.220 --> 00:35:37.254
And you sort of, only, only back propagate positive gradients through
relu's and you do not back propagate negative gradients through the relu's.

00:35:37.254 --> 00:35:46.948
So you're no longer computing the true gradient instead you're kind of only
keeping track of positive influences on throughout the entire network.

00:35:46.948 --> 00:35:53.614
So maybe you should read through these, these papers reference to your,
if you want a little bit more details about why that's a good idea.

00:35:53.614 --> 00:36:01.649
But empirically, when you do guided back propagation as appose to
regular back propagation. You tend to get much cleaner, nicer images.

00:36:01.649 --> 00:36:07.223
that tells you, which part, which pixel of the
input image influence that particular neuron.

00:36:07.223 --> 00:36:12.467
So, again we were seeing the same visualization we saw
a few slides ago of the maximally activating patches.

00:36:16.488 --> 00:36:20.174
But now, in addition to visualizing
these maximally activating patches.

00:36:20.174 --> 00:36:27.604
We've also performed guided back propagation, to tell us exactly
which parts of these patches influence the score of that neuron.

00:36:27.604 --> 00:36:37.139
So, remember for this example at the top, we saw that, we thought this neuron is may be looking
for circly tight things, in the input patch because there're allot of circly tight patches.

00:36:37.139 --> 00:36:42.028
Well, when we look at guided back propagation We
can see with that intuition is somewhat confirmed

00:36:42.028 --> 00:36:49.218
because it is indeed the circly parts of that input
patch which are influencing that, that neuron value.

00:36:49.218 --> 00:36:56.514
So, this is kind of a useful to all for synthesizing. For
understanding what these different intermediates are looking for.

00:36:56.514 --> 00:37:05.108
But, one kind of interesting thing about guided back propagation or computing
saliency maps. Is that there's always a function of fixed input image,

00:37:05.108 --> 00:37:12.882
right, they're telling us for a fixed input image, which pixel or
which parts of that input image influence the value of the neuron.

00:37:12.882 --> 00:37:19.110
Another question you might answer is is remove
this reliance, on that, on some input image.

00:37:19.110 --> 00:37:24.641
And then instead just ask what type of input
in general would cause this neuron to activate

00:37:24.641 --> 00:37:29.118
and we can answer this question
using a technical Gradient ascent

00:37:29.118 --> 00:37:34.903
so, remember we always use Gradient decent to train
our convolutional networks by minimizing the loss.

00:37:34.903 --> 00:37:40.552
Instead now, we want to fix the, fix the
weight of our trained convolutional network

00:37:40.552 --> 00:37:50.932
and instead synthesizing image by performing Gradient ascent on the pixels of the
image to try and maximize the score of some intermediate neuron or of some class.

00:37:50.932 --> 00:37:58.333
So, in a process of Gradient ascent, we're no longer optimizing
over the weights of the network those weights remained fixed

00:37:58.333 --> 00:38:07.104
instead we're trying to change pixels of some input image to cause this neuron,
or this neuron value, or this class score to maximally, to be maximized

00:38:07.104 --> 00:38:10.475
but, instead but, in addition
we need some regularization term

00:38:10.475 --> 00:38:19.078
so, remember we always a, we before seeing regularization terms to try
to prevent the network weights from over fitting to the training data.

00:38:19.078 --> 00:38:27.109
Now, we need something kind of similar to prevent the pixels of our generated
image from over fitting to the peculiarities of that particular network.

00:38:27.109 --> 00:38:34.664
So, here we'll often incorporate some regularization term that,
we're kind of, we want a generated image of two properties

00:38:34.664 --> 00:38:39.269
one, we wanted to maximally activate some,
some score or some neuron value.

00:38:39.269 --> 00:38:42.111
But, we also wanted to
look like a natural image.

00:38:42.111 --> 00:38:46.485
we wanted to kind of have, the kind of statistics
that we typically see in natural images.

00:38:46.485 --> 00:38:52.936
So, these regularization term in the subjective is something
to enforce a generated image to look relatively natural.

00:38:52.936 --> 00:38:57.116
And we'll see a couple of different
examples of regualizers as we go through.

00:38:57.116 --> 00:39:04.371
But, the general strategy for this is actually pretty simple and again
informant allot of things of this nature on your assignment three.

00:39:04.371 --> 00:39:10.410
But, what we'll do is start with some initial image
either initializing to zeros or to uniform or noise.

00:39:10.410 --> 00:39:19.922
But, initialize your image in some way and I'll repeat where you forward your image
through 3D network and compute the score or, or neuron value that you're interested.

00:39:19.922 --> 00:39:26.643
Now, back propagate to compute the Gradient of that
neuron score with respect to the pixels of the image

00:39:26.643 --> 00:39:33.897
and then make a small Gradient ascent or Gradient ascent update to
the pixels of the images itself. To try and maximize that score.

00:39:33.897 --> 00:39:38.786
And I'll repeat this process over and over
again, until you have a beautiful image.

00:39:38.786 --> 00:39:42.311
And, then we talked, we talked
about the image regularizer,

00:39:42.311 --> 00:39:49.428
well a very simple, a very simple idea for image regularizer
is simply to penalize L2 norm of a generated image

00:39:49.428 --> 00:39:51.466
This is not so semantically meaningful,

00:39:51.466 --> 00:40:01.764
it's just does something, and this was one of the, one of the earliest regularizer
that we've seen in the literature for these type of generating images type of papers.

00:40:01.764 --> 00:40:12.153
And, when you run this on a trained network you can see that now we're trying to generate
images that maximize the dumble score in the upper left hand corner here for example.

00:40:12.153 --> 00:40:14.820
And, then you can see that
the synthesized image,

00:40:14.820 --> 00:40:19.726
it been, it's little bit hard to see may be but
there're allot of different dumble like shapes,

00:40:19.726 --> 00:40:23.162
all kind of super impose
that different portions of the image.

00:40:23.162 --> 00:40:29.111
or if we try to generate an image for cups then we can may
be see a bunch of different cups all kind of super imposed

00:40:29.111 --> 00:40:30.466
the Dalmatian is pretty cool,

00:40:30.466 --> 00:40:35.478
because now we can see kind of this black and white spotted
pattern that's kind of characteristics of Dalmatians

00:40:35.478 --> 00:40:40.388
or for lemons we can see these different
kinds of yellow splotches in the image.

00:40:40.388 --> 00:40:43.539
And there's a couple of more examples here,
I think may be the goose is kind of cool

00:40:43.539 --> 00:40:46.514
or the kitfox are actually
may be looks like kitfox.

00:40:46.514 --> 00:40:47.454
Question?

00:40:55.528 --> 00:40:57.929
The question is, why are
these all rainbow colored

00:40:57.929 --> 00:41:02.434
and in general getting true colors out
of this visualization is pretty tricky.

00:41:02.434 --> 00:41:06.693
Right, because any, any actual image will
be bounded in the range zero to 255.

00:41:06.693 --> 00:41:10.395
So, it really should be some kind
of constrained optimization problem

00:41:10.395 --> 00:41:15.721
But, if, for using this generic methods for Gradient
ascent then we, that's going to be unconstrained problem.

00:41:15.721 --> 00:41:21.848
So, you may be use like projector Gradient ascent
algorithm or your rescaled image at the end.

00:41:21.848 --> 00:41:27.799
So, the colors that you see in this visualizations,
sometimes are you cannot take them too seriously.

00:41:27.799 --> 00:41:28.702
Question?

00:41:32.801 --> 00:41:36.846
The question is what happens, if you let the
thing loose and don't put any regularizer on it.

00:41:36.846 --> 00:41:44.860
Well, then you tend to get an image which maximize the score
which is confidently classified as the class you wanted

00:41:44.860 --> 00:41:48.522
but, usually it doesn't look like anything.
It kind of look likes random noise.

00:41:48.522 --> 00:41:54.538
So, that's kind of an interesting property in itself
that will go into much more detail in a future lecture.

00:41:54.538 --> 00:42:00.913
But, that's why, that kind of doesn't help you so much for
understanding what things the network is looking for.

00:42:00.913 --> 00:42:09.607
So, if we want to understand, why the network thing makes its decisions then it's
kind of useful to put regularizer on there to generate an image to look more natural.

00:42:09.607 --> 00:42:10.471
A question in the back.

00:42:34.416 --> 00:42:38.492
Yeah, so the question is that we see a lot of multi
modality here, and other ways to combat that.

00:42:38.492 --> 00:42:44.847
And actually yes, we'll see that, this is kind of first step
in the whole line of work in improving these visualizations.

00:42:44.847 --> 00:42:51.517
So, another, another kind of, so then the angle here is a kind
of to improve the regularizer to improve our visualized images.

00:42:51.517 --> 00:42:58.621
And there's a another paper from Jason Yesenski and some of his
collaborators where they added some additional impressive regularizers.

00:42:58.621 --> 00:43:00.924
So, in addition to this
L2 norm constraint,

00:43:00.924 --> 00:43:06.213
in addition we also periodically during optimization,
and do some gauche and blurring on the image,

00:43:06.213 --> 00:43:12.441
we're also clip some,. some small value, some small pixel
values all the way to zero, we're also clip some of the,

00:43:12.441 --> 00:43:14.694
some of the pixel values
of low Gradients to zero

00:43:14.694 --> 00:43:17.559
So, you can see this is kind of
a projector Gradient ascent algorithm

00:43:17.559 --> 00:43:24.555
where it reach periodically we're projecting our generated image
onto some nicer set of images with some nicer properties.

00:43:24.555 --> 00:43:28.241
For example, special smoothness
with respect to the gauchian blurring

00:43:28.241 --> 00:43:32.870
So, when you do this, you tend to get much
nicer images that are much clear to see.

00:43:32.870 --> 00:43:38.553
So, now these flamingos look like flamingos the
ground beetle is starting to look more beetle like

00:43:38.553 --> 00:43:41.695
or this black swan maybe
looks like a black swan.

00:43:41.695 --> 00:43:48.211
These billiard tables actually look kind of impressive now,
where you can definitely see this billiard table structure.

00:43:48.211 --> 00:43:55.209
So, you can see that once you add in nicer regularizers, then
the generated images become a bit, a little bit cleaner.

00:43:55.209 --> 00:44:01.038
And, now we can perform this procedure not only for the final
class course, but also for these intermediate neurons as well.

00:44:01.038 --> 00:44:10.111
So, instead of trying to maximize our billiard table score for example
instead we can get maximize one of the neurons from some intermediate layer

00:44:10.111 --> 00:44:11.118
Question.

00:44:16.743 --> 00:44:19.393
So, the question is what's
with the for example here,

00:44:19.393 --> 00:44:21.794
so those who remember
initializing our image randomly

00:44:21.794 --> 00:44:25.681
so, these four images would be different
random initialization of the input image.

00:44:28.106 --> 00:44:36.113
And again, we can use these same type of procedure to visualize, to synthesis
images which maximally activate intermediate neurons of the network.

00:44:36.113 --> 00:44:40.174
And, then you can get a sense from some of
these intermediate neurons are looking for,

00:44:40.174 --> 00:44:44.605
so may be at layer four there's neuron
that's kind of looking for spirally things

00:44:44.605 --> 00:44:49.703
or there's neuron that's may be looking for like chunks
of caterpillars it's a little bit harder to tell.

00:44:49.703 --> 00:44:56.585
But, in generally as you go larger up in the image then you can see that
the one, the obviously receptive fields of these neurons are larger.

00:44:56.585 --> 00:44:58.664
So, you're looking at the
larger patches in the image.

00:44:58.664 --> 00:45:03.549
And they tend to be looking for may be larger
structures or more complex patterns in the input image.

00:45:03.549 --> 00:45:04.802
That's pretty cool.

00:45:07.499 --> 00:45:15.559
And, then people have really gone crazy with this and trying to, they
basically improve these visualization by keeping on extra features

00:45:15.559 --> 00:45:23.697
So, this was a cool paper kind of explicitly trying to address this
multi modality, there's someone asked question about a few minutes ago.

00:45:23.697 --> 00:45:29.849
So, here they were trying to explicitly take a count, take
this multi modality into account in the optimization procedure

00:45:29.849 --> 00:45:35.254
where they did indeed, I think see the initial, so they
for each of the classes, you run a clustering algorithm

00:45:35.254 --> 00:45:42.667
to try to separate the classes into different modes and then
initialize with something that is close to one of those modes.

00:45:42.667 --> 00:45:45.890
And, then when you do that, you kind
of account for this multi modality.

00:45:45.890 --> 00:45:51.675
so for intuition, on the right here these
eight images are all of grocery stores.

00:45:51.675 --> 00:45:56.401
But, the top row, is kind of close
up pictures of produce on the shelf

00:45:56.401 --> 00:45:59.068
and those are labeled as grocery stores

00:45:59.068 --> 00:46:04.221
And the bottom row kind of shows people walking around grocery
stores or at the checkout line or something like that.

00:46:04.221 --> 00:46:06.085
And, those are also labeled
those as grocery store,

00:46:06.085 --> 00:46:08.073
but their visual appearance
is quiet different.

00:46:08.073 --> 00:46:10.988
So, a lot of these classes and
that being sort multi modal

00:46:10.988 --> 00:46:17.648
And, if you can take, and if you explicitly take this more time mortality
into account when generating images, then you can get nicer results.

00:46:17.648 --> 00:46:22.569
And now, then when you look at some of their
example, synthesis images for classes,

00:46:22.569 --> 00:46:31.840
you can see like the bell pepper, the card on, strawberries, jackolantern
now they end up with some very beautifully generated images.

00:46:31.840 --> 00:46:38.177
And now, I don't want to get to much into detail of
the next slide. But, then you can even go crazier.

00:46:38.177 --> 00:46:43.623
and add an even stronger image prior and
generate some very beautiful images indeed

00:46:43.623 --> 00:46:48.921
So, these are all synthesized images that are trying
to maximize the class score or some image in a class.

00:46:48.921 --> 00:46:59.020
But, the general idea is that rather than optimizing directly the pixels of the input
image, instead they're trying to optimize the FC6 representation of that image instead.

00:46:59.020 --> 00:47:03.342
And now they need to use some feature inversion network
and I don't want to get into the details here.

00:47:03.342 --> 00:47:05.290
You should read the paper,
it's actually really cool

00:47:05.290 --> 00:47:11.905
But, the point is that, when you start adding
additional priors towards modeling natural images

00:47:11.905 --> 00:47:16.662
and you can end generating some quiet realistic images they
gave you some sense of what the network is looking for

00:47:18.951 --> 00:47:23.839
So, that's, that's sort of one cool thing that
we can do with this strategy, but this idea

00:47:23.839 --> 00:47:29.893
of trying to synthesis images by using Gradients
on image pixels, is actually super powerful.

00:47:29.893 --> 00:47:34.288
And, another really cool thing we can do
with this, is this concept of fooling image

00:47:34.288 --> 00:47:43.362
So, what we can do is pick some arbitrary image, and then try to maximize
the, so, say we take it picture of an elephant and then we tell the network

00:47:43.362 --> 00:47:49.418
that we want to, change the image to
maximize the score of Koala bear instead

00:47:49.418 --> 00:47:57.064
So, then what we were doing is trying to change that image of an elephant
to try and instead cause the network to classify as a Koala bear.

00:47:57.064 --> 00:48:05.931
And, what you might hope for is that, maybe that elephant was sort of thought more thing
into a Koala bear and maybe he would sprout little cute ears or something like that.

00:48:05.931 --> 00:48:09.241
But, that's not what happens in practice,
which is pretty surprising.

00:48:09.241 --> 00:48:17.377
Instead if you take this picture of a elephant and tell them that, tell them that and
try to change the elephant image to instead cause it to be classified as a koala bear

00:48:17.377 --> 00:48:24.853
What you'll find is that, you is that this second image on the right
actually is classified as koala bear but it looks the same to us.

00:48:24.853 --> 00:48:28.016
So that's pretty fishy
and pretty surprising.

00:48:28.016 --> 00:48:34.114
So also on the bottom we've taken this picture
of a boat. Schooner is the image in that class

00:48:34.114 --> 00:48:37.170
and then we told the network
to classified as an iPod.

00:48:37.170 --> 00:48:41.881
So now the second example looks just, still looks
like a boat to us but the network thinks it's an iPod

00:48:41.881 --> 00:48:46.260
and the difference is in pixels between
these two images are basically nothing.

00:48:46.260 --> 00:48:52.025
And if you magnify those differences you don't really see
any iPod or Koala like features on these differences,

00:48:52.025 --> 00:48:58.924
they're just kind of like random patterns of noise. So the question
is what's going here? And like how can this possibly the case?

00:48:58.924 --> 00:49:03.635
Well, we'll have a guest lecture from Ian
Goodfellow in a week an half two weeks.

00:49:03.635 --> 00:49:08.068
And he's going to go in much more detail about this
type of phenomenon and that will be really exciting.

00:49:08.068 --> 00:49:11.006
But I did want to mention it here
because it is on your homework.

00:49:11.006 --> 00:49:11.595
Question?

00:49:16.320 --> 00:49:20.050
Yeah, so that's something, so the question
is can we use fooled images as training data

00:49:20.050 --> 00:49:27.214
and I think, Ian's going to go in much more detail on all of these types of
strategies. Because that's literally, that's really a whole lecture onto itself.

00:49:27.214 --> 00:49:28.885
Question?

00:50:00.608 --> 00:50:03.478
The question is why do we
care about any of this stuff?

00:50:03.478 --> 00:50:08.685
Basically... Okay, maybe that was a
mischaracterization, I am sorry.

00:50:24.573 --> 00:50:32.027
Yeah, the question is what is have in the... understanding this intermediate
neurons how does that help our understanding of the final classification.

00:50:32.027 --> 00:50:38.921
So this is actually, this whole field of trying to visualize intermediates
is kind of in response to a common criticism of deep learning.

00:50:38.921 --> 00:50:43.011
So a common criticism of deep learning is
like, you've got this big black box network

00:50:43.011 --> 00:50:47.350
you trained it on gradient ascent, you get a good
number and that's great but we don't trust the network

00:50:47.350 --> 00:50:51.272
because we don't understand as people why it's
making the decisions, that's it's making.

00:50:51.272 --> 00:51:01.530
So a lot of these type of visualization techniques were developed to try and address that and try to understand
as people why the network are making their various classification, classification decisions a bit more.

00:51:01.530 --> 00:51:07.721
Because if you contrast, if you contrast a deep convolutional
neural network with other machine running techniques.

00:51:07.722 --> 00:51:10.493
Like linear models are much
easier to interpret in general

00:51:10.493 --> 00:51:17.457
because you can look at the weights and kind of understand the interpretation between
how much each input feature effect the decision or if you look at something like

00:51:17.458 --> 00:51:19.459
a random forest or decision tree.

00:51:19.459 --> 00:51:27.442
Some other machine learning models end up being a bit more interpretable just
by their very nature then this sort of black box convolutional networks.

00:51:27.442 --> 00:51:33.520
So a lot of this is sort of in response to that criticism
to say that, yes they are these large complex models

00:51:33.520 --> 00:51:37.263
but they are still doing some interesting
and interpretable things under the hood.

00:51:37.263 --> 00:51:42.201
They are not just totally going out in randomly
classifying things. They are doing something meaningful

00:51:44.891 --> 00:51:50.989
So another cool thing we can do with this gradient based
optimization of images is this idea of DeepDream.

00:51:50.989 --> 00:51:55.592
So this was a really cool blog post that
came out from Google a year or two ago.

00:51:55.592 --> 00:52:00.859
And the idea is that, this is, so we talked about
scientific value, this is almost entirely for fun.

00:52:00.859 --> 00:52:04.284
So the point of this exercise is mostly
to generate cool images.

00:52:04.284 --> 00:52:10.186
And aside, you also get some sense for what features
images are looking at. Or these networks are looking at.

00:52:10.186 --> 00:52:15.275
So we can do is, we take our input image we run it
through the convolutional network up to some layer

00:52:15.275 --> 00:52:17.035
and now we back propagate

00:52:17.035 --> 00:52:20.742
and set the gradient of that, at that
layer equal to the activation value.

00:52:20.742 --> 00:52:25.427
And now back propagate, back to the image and
update the image and repeat, repeat, repeat.

00:52:25.427 --> 00:52:31.682
So this has the interpretation of trying to amplify existing
features that were detected by the network in this image. Right?

00:52:31.682 --> 00:52:35.875
Because whatever features existed on that layer
now we set the gradient equal to the feature

00:52:35.875 --> 00:52:40.010
and we just tell the network to amplify whatever
features you already saw in that image.

00:52:40.010 --> 00:52:46.918
And by the way you can also see this as trying to maximize
the L2 norm of the features at that layer of the image.

00:52:46.918 --> 00:52:55.999
And it turns... And when you do this the code ends up looking really simple. So your code for many of
your homework assignments will probably be about this complex or maybe even a little bit a less so.

00:52:55.999 --> 00:53:00.785
So the idea is that... But there's a couple of tricks
here that you'll also see in your assignments.

00:53:00.785 --> 00:53:04.443
So one trick is to jitter the image
before you compute your gradients.

00:53:04.443 --> 00:53:11.187
So rather than running the exact image through the network instead you'll shift
the image over by two pixels and kind of wrap the other two pixels over here.

00:53:11.187 --> 00:53:19.540
And this is a kind of regularizer to prevent each of these [mumbling] it regularizers
a little bit to encourage a little bit of extra special smoothness in the image.

00:53:19.540 --> 00:53:26.653
You also see they use L1 normalization of the gradients that's kind of
a useful trick sometimes when doing this image generation problems.

00:53:26.653 --> 00:53:33.843
You also see them clipping the pixel values once in a while. So again
we talked about images actually should be between zero to 2.55

00:53:33.843 --> 00:53:39.335
so this is a kind of projected gradients decent where
we project on to the space of actual valid images.

00:53:39.335 --> 00:53:46.215
But now when we do all this then we start, we might start with
some image of a sky and then we get really cool results like this.

00:53:46.215 --> 00:53:52.614
So you can see that now we've taken these tiny features on the
sky and they get amplified through this, through this process.

00:53:52.614 --> 00:53:59.007
And we can see things like this different mutant animals
start to pop up or these kind of spiral shapes pop up.

00:53:59.007 --> 00:54:04.296
Different kinds of houses and cars pop up. So
that's all, that's all pretty interesting.

00:54:04.296 --> 00:54:08.743
There's a couple patterns in particular that
pop up all the time that people have named.

00:54:08.743 --> 00:54:12.133
Right, so there's this Admiral
dog, that shows up allot.

00:54:12.133 --> 00:54:16.033
There's the pig snail, the camel bird
this the dog fish.

00:54:16.033 --> 00:54:22.771
Right, so these are kind of interesting, but actually this fact that
dog show up so much in these visualization, actually does tell us

00:54:22.771 --> 00:54:26.249
something about the data on
which this network was trained.

00:54:26.249 --> 00:54:30.786
Right, because this is a network that was trained for image
net classification, image that have thousand categories.

00:54:30.786 --> 00:54:32.915
But 200 of those categories are dogs.

00:54:32.915 --> 00:54:44.027
So, so it's kind of not surprising in a sense that when you do these kind of visualizations then network
ends up hallucinating a lot of dog like stuff in the image often morphed with other types of animals.

00:54:44.027 --> 00:54:47.327
When you do this other layers of the
network you get other types of results.

00:54:47.327 --> 00:54:52.708
So here we're taking one of these lower layers in the network,
the previous example was relatively high up in the network

00:54:52.708 --> 00:54:57.791
and now again we have this interpretation that lower layers
maybe computing edges and swirls and stuff like that

00:54:57.791 --> 00:55:01.766
and that's kind of borne out when we
running DeepDream at a lower layer.

00:55:01.766 --> 00:55:08.346
Or if you run this thing for a long time and maybe add in some
multiscale processing you can get some really, really crazy images.

00:55:08.346 --> 00:55:14.631
Right, so here they're doing a kind of multiscale processing where they start
with a small image run DeepDream on the small image then make it bigger

00:55:14.631 --> 00:55:19.893
and continue DeepDream on the larger image and kind of
repeat with this multiscale processing and then you can get,

00:55:19.893 --> 00:55:25.699
and then maybe after you complete the final scale then you
restart from the beginning and you just go wild on this thing.

00:55:25.699 --> 00:55:28.126
And you can get some really crazy images.

00:55:28.126 --> 00:55:31.454
So these examples were all from networks
trained on image net

00:55:31.454 --> 00:55:35.216
There's another data set from
MIT called MIT Places Data set

00:55:35.216 --> 00:55:40.224
but instead of 1,000 categories of objects
instead it's 200 different types of scenes

00:55:40.224 --> 00:55:42.663
like bedrooms and kitchens
like stuff like that.

00:55:42.663 --> 00:55:50.868
And now if we repeat this DeepDream procedure using an network trained
at MIT places. We get some really cool visualization as well.

00:55:50.868 --> 00:55:59.491
So now instead of dogs, slugs and admiral dogs and that's kind of stuff, instead
we often get these kind of roof shapes of these kind of Japanese style building

00:55:59.491 --> 00:56:02.104
or these different types of
bridges or mountain ranges.

00:56:02.104 --> 00:56:05.288
They're like really, really
cool beautiful visualizations.

00:56:05.288 --> 00:56:11.685
So the code for DeepDream is online, released by Google you
can go check it out and make your own beautiful pictures

00:56:11.685 --> 00:56:14.535
So there's another kind of...
Sorry question?

00:56:24.731 --> 00:56:28.252
So the question is, what
are taking gradient of?

00:56:28.252 --> 00:56:33.318
So like I say, if you, because like one over
x squared on the gradient of that is x.

00:56:33.318 --> 00:56:44.477
So, if you send back the volume of activation as the gradient, that's equivalent to max, that's
equivalent to taking the gradient with respect to the like one over x squared some... Some of the values.

00:56:44.477 --> 00:56:49.665
So it's equivalent to maximizing the norm
of that of the features of that layer.

00:56:49.665 --> 00:56:56.511
But in practice many implementation you'll see not
explicitly compute that instead of send gradient back.

00:56:56.511 --> 00:57:01.478
So another kind of useful, another kind of useful
thing we can do is this concept of feature inversion.

00:57:01.478 --> 00:57:07.687
So this again gives us a sense for what types of, what types of
elements of the image are captured at different layers of the network.

00:57:07.687 --> 00:57:12.220
So what we're going to do now is we're going to
take an image, run that image through network

00:57:12.220 --> 00:57:15.832
record the feature value
for one of those images

00:57:15.832 --> 00:57:20.283
and now we're going to try to reconstruct
that image from its feature representation.

00:57:20.283 --> 00:57:31.074
And the question, and now based on the how much, how much like what that reconstructed image looks like
that'll give us some sense for what type of information about the image was captured in that feature vector.

00:57:31.074 --> 00:57:34.191
So again, we can do this with gradient
ascent with some regularizer.

00:57:34.191 --> 00:57:41.709
Where now rather than maximizing some score instead we want
to minimize the distance between this catch feature vector.

00:57:41.709 --> 00:57:50.014
And between the computed features of our generated image. To try and again
synthesize a new image that matches the feature back to that we computed before.

00:57:50.014 --> 00:57:56.856
And another kind of regularizer that you frequently see here is the
total variation regularizer that you also see on your homework.

00:57:56.856 --> 00:58:05.954
So here with the total variation regularizer is panelizing differences between adjacent
pixels on both of the left and adjacent in left and right and adjacent top to bottom.

00:58:05.954 --> 00:58:09.956
To again try to encourage special
smoothness in the generated image.

00:58:09.956 --> 00:58:16.369
So now if we do this idea of feature inversion so this
visualization here on the left we're showing some original image.

00:58:16.369 --> 00:58:18.294
The elephants or the fruits at the left.

00:58:18.294 --> 00:58:22.458
And then we run that,
we run the image through a VGG-16 network.

00:58:22.458 --> 00:58:30.013
Record the features of that network at some layer and then try to
synthesize a new image that matches the recorded features of that layer.

00:58:30.013 --> 00:58:37.534
And this is, this kind of give us a sense for what how much information
is stored in this images. In these features of different layers.

00:58:37.534 --> 00:58:43.849
So for example if we try to reconstruct the image based
on the relu2_2 features from VGC's, from VGG-16.

00:58:43.849 --> 00:58:46.628
We see that the image gets
almost perfectly reconstructed.

00:58:46.628 --> 00:58:52.664
Which means that we're not really throwing away much
information about the raw pixel values at that layer.

00:58:52.664 --> 00:58:58.593
But as we move up into the deeper parts of the network
and try to reconstruct from relu4_3, relu5_1.

00:58:58.593 --> 00:59:05.488
We see that our reconstructed image now, we've kind of kept the
general space, the general spatial structure of the image.

00:59:05.488 --> 00:59:09.684
You can still tell that, that it's a
elephant or a banana or a, or an apple.

00:59:09.684 --> 00:59:16.427
But a lot of the low level details aren't exactly what the pixel values
were and exactly what the colors were, exactly what the textures were.

00:59:16.427 --> 00:59:20.923
These are kind of low level details are kind of
lost at these higher layers of this network.

00:59:20.923 --> 00:59:29.153
So that gives us some sense that maybe as we move up through the flairs of the network
it's kind of throwing away this low level information about the exact pixels of the image

00:59:29.153 --> 00:59:38.109
and instead is maybe trying to keep around a little bit more semantic information, it's
a little bit invariant for small changes in color and texture and things like that.

00:59:38.109 --> 00:59:42.835
So we're building towards a style
transfer here which is really cool.

00:59:42.835 --> 00:59:51.029
So in addition to understand style transfer, So in texture synthesis, this is kind of an old problem
in computer graphics. We also need to talk about a related problem called texture synthesis.

00:59:51.029 --> 00:59:55.112
So in texture synthesis, this is kind
of an old problem in computer graphics.

00:59:55.112 --> 01:00:05.792
Here the idea is that we're given some input patch of texture. Something like these little scales
here and now we want to build some model and then generate a larger piece of that same texture.

01:00:05.792 --> 01:00:12.056
So for example, we might here want to generate a large
image containing many scales that kind of look like input.

01:00:12.056 --> 01:00:15.986
And this is again a pretty old
problem in computer graphics.

01:00:15.986 --> 01:00:19.720
There are nearest neighbor approaches to
textual synthesis that work pretty well.

01:00:19.720 --> 01:00:21.659
So, there's no neural networks here.

01:00:21.659 --> 01:00:27.792
Instead, this kind of a simple algorithm where we march through
the generated image one pixel at a time in scan line order.

01:00:27.792 --> 01:00:34.742
And then copy... And then look at a neighborhood around the
current pixel based on the pixels that we've already generated

01:00:34.742 --> 01:00:41.934
and now compute a nearest neighbor of that neighborhood in the patches
of the input image and then copy over one pixel from the input image.

01:00:41.934 --> 01:00:48.889
So, maybe you don't need to understand the details here just the idea is that
there's a lot classical algorithms for texture synthesis, it's a pretty old problem

01:00:48.889 --> 01:00:52.749
but you can do this without
neural networks basically.

01:00:52.749 --> 01:00:59.915
And when you run this kind of this kind of classical texture synthesis
algorithm it actually works reasonably well for simple textures.

01:00:59.915 --> 01:01:08.970
But as we move to more complex textures these kinds of simple methods of
maybe copying pixels from the input patch directly tend not to work so well.

01:01:08.970 --> 01:01:16.494
So, in 2015, there was a really cool paper that tried to apply
neural network features to this problem of texture synthesis.

01:01:16.494 --> 01:01:24.753
And ended up framing it as kind of a gradient ascent procedure, kind of similar to
the feature map, the various feature matching objectives that we've seen already.

01:01:24.753 --> 01:01:30.558
So, in order to perform neural texture synthesis
they use this concept of a gram matrix.

01:01:30.558 --> 01:01:36.372
So, what we're going to do, is we're going to take our
input texture and in this case some pictures of rocks

01:01:36.372 --> 01:01:44.347
and then take that input texture and pass it through some convolutional neural
network and pull out convolutional features at some layer of the network.

01:01:44.347 --> 01:01:53.596
So, maybe then this convolutional feature volume that we've talked about,
might be H by W by C or sorry, C by H by W at that layer of the network.

01:01:53.596 --> 01:01:56.515
So, you can think of this
as an H by W spacial grid.

01:01:56.515 --> 01:02:04.347
And at each point of the grid, we have this C dimensional feature
vector describing the rough appearance of that image at that point.

01:02:04.347 --> 01:02:10.179
And now, we're going to use this activation map to
compute a descriptor of the texture of this input image.

01:02:10.179 --> 01:02:15.294
So, what we're going to do is take, pick out two of
these different feature columns in the input volume.

01:02:15.294 --> 01:02:18.318
Each of these feature columns
will be a C dimensional vector.

01:02:18.318 --> 01:02:23.390
And now take the outer product between those
two vectors to give us a C by C matrix.

01:02:23.390 --> 01:02:30.333
This C by C matrix now tells us something about the co-occurrence
of the different features at those two points in the image.

01:02:30.333 --> 01:02:40.218
Right, so, if an element, if like element IJ in the C by C matrix is large that means
both elements I and J of those two input vectors were large and something like that.

01:02:40.218 --> 01:02:51.572
So, this somehow captures some second order statistics about which features, in that feature map
tend to activate to together at different spacial volumes... At different spacial positions.

01:02:51.572 --> 01:03:01.664
And now we're going to repeat this procedure using all different pairs of feature vectors from all
different points in this H by W grid. Average them all out, and that gives us our C by C gram matrix.

01:03:01.664 --> 01:03:06.323
And this is then used a descriptor to describe
kind of the texture of that input image.

01:03:06.323 --> 01:03:13.623
So, what's interesting about this gram matrix is that it has now
thrown away all spacial information that was in this feature volume.

01:03:13.623 --> 01:03:17.545
Because we've averaged over all pairs of
feature vectors at every point in the image.

01:03:17.545 --> 01:03:21.863
Instead, it's just capturing the second order
co-occurrence statistics between features.

01:03:21.863 --> 01:03:25.364
And this ends up being a
nice descriptor for texture.

01:03:25.364 --> 01:03:27.640
And by the way, this is
really efficient to compute.

01:03:27.640 --> 01:03:39.682
So, if you have a C by H by W three dimensional tensure you can just reshape it to see times H by
W and take that times its own transpose and compute this all in one shot so it's super efficient.

01:03:39.682 --> 01:03:45.417
But you might be wondering why you don't use an actual covariance
matrix or something like that instead of this funny gram matrix

01:03:45.417 --> 01:03:51.845
and the answer is that using covariance... Using true covariance
matrices also works but it's a little bit more expensive to compute.

01:03:51.845 --> 01:03:55.203
So, in practice a lot of people
just use this gram matrix descriptor.

01:03:55.203 --> 01:04:06.916
So then... Then there's this... Now once we have this sort of neural descriptor of texture then we use a similar
type of gradient ascent procedure to synthesize a new image that matches the texture of the original image.

01:04:06.916 --> 01:04:10.913
So, now this looks kind of like the feature
reconstruction that we saw a few slides ago.

01:04:10.913 --> 01:04:20.883
But instead, I'm trying to reconstruct the whole feature map from the input image. Instead, we're
just going to try and reconstruct this gram matrix texture descriptor of the input image instead.

01:04:20.883 --> 01:04:25.969
So, in practice what this looks like is that well... You'll
download some pretrained model, like in feature inversion.

01:04:25.969 --> 01:04:28.720
Often, people will use
the VGG networks for this.

01:04:28.720 --> 01:04:38.553
You'll feed your... You'll take your texture image, feed it through the VGG
network, compute the gram matrix and many different layers of this network.

01:04:38.553 --> 01:04:47.414
Then you'll initialize your new image from some random initialization and then it
looks like gradient ascent again. Just like for these other methods that we've seen.

01:04:47.414 --> 01:04:52.530
So, you take that image, pass it through the same VGG
network, Compute the gram matrix at various layers

01:04:52.530 --> 01:05:00.833
and now compute loss as the L2 norm between the gram
matrices of your input texture and your generated image.

01:05:00.833 --> 01:05:06.025
And then you back prop, and compute pixel... A
gradient of the pixels on your generated image.

01:05:06.025 --> 01:05:09.273
And then make a gradient ascent step to
update the pixels of the image a little bit.

01:05:09.273 --> 01:05:17.071
And now, repeat this process many times, go forward, compute your gram
matrices, compute your losses, back prop.. Gradient on the image and repeat.

01:05:17.071 --> 01:05:22.702
And once you do this, eventually you'll end up generating
a texture that matches your input texture quite nicely.

01:05:22.702 --> 01:05:30.022
So, this was all from Nip's 2015 paper by a group in Germany.
And they had some really cool results for texture synthesis.

01:05:30.022 --> 01:05:33.531
So, here on the top, we're showing
four different input textures.

01:05:33.531 --> 01:05:41.133
And now, on the bottom, we're showing doing this
texture synthesis approach by gram matrix matching.

01:05:41.133 --> 01:05:45.681
Using, by computing the gram matrix at different
layers at this pretrained convolutional network.

01:05:45.681 --> 01:05:56.965
So, you can see that, if we use these very low layers in the convolutional network then we kind of match the general...
We generally get splotches of the right colors but the overall spacial structure doesn't get preserved so much.

01:05:56.965 --> 01:06:06.935
And now, as we move to large down further in the image and you compute these gram matrices
at higher layers you see that they tend to reconstruct larger patterns from the input image.

01:06:06.935 --> 01:06:10.107
For example, these whole rocks
or these whole cranberries.

01:06:10.107 --> 01:06:17.677
And now, this works pretty well that now we can synthesize these new
images that kind of match the general spacial statistics of your inputs.

01:06:17.677 --> 01:06:21.445
But they are quite different pixel wise
from the actual input itself.

01:06:21.445 --> 01:06:22.528
Question?

01:06:28.481 --> 01:06:30.847
So, the question is, where
do we compute the loss?

01:06:30.847 --> 01:06:40.285
And in practice, we want to get good results typically people will compute gram matrices at many
different layers and then the final loss will be a sum of all those potentially a weighted sum.

01:06:40.285 --> 01:06:47.940
But I think for this visualization, to try to pin point the effect of the
different layers I think these were doing reconstruction from just one layer.

01:06:47.940 --> 01:06:52.999
So, now something really... Then, then they had a
really brilliant idea kind of after this paper

01:06:52.999 --> 01:07:01.417
which is, what if we do this texture synthesis approach but instead of using an
image like rocks or cranberries what if we set it equal to a piece of artwork.

01:07:01.417 --> 01:07:03.748
So then, for example, if you...

01:07:03.748 --> 01:07:10.333
If you do the same texture synthesis algorithm by maximizing
gram matrices, but instead of... But now we take, for example,

01:07:10.333 --> 01:07:14.656
Vincent Van Gogh's Starry night
or the Muse by Picasso as our texture...

01:07:14.656 --> 01:07:19.759
As our input texture, and then run
this same texture synthesis algorithm.

01:07:19.759 --> 01:07:25.683
Then we can see our generated images tend to reconstruct
interesting pieces from those pieces of artwork.

01:07:25.683 --> 01:07:34.616
And now, something really interesting happens when you combine this idea of texture
synthesis by gram matrix matching with feature inversion by feature matching.

01:07:34.616 --> 01:07:38.988
And then this brings us to this really
cool algorithm called style transfer.

01:07:38.988 --> 01:07:42.716
So, in style transfer, we're
going to take two images as input.

01:07:42.716 --> 01:07:49.813
One, we're going to take a content image that will guide like what
type of thing we want. What we generally want our output to look like.

01:07:49.813 --> 01:07:55.499
Also, a style image that will tell us what is the general
texture or style that we want our generated image to have

01:07:55.499 --> 01:08:02.596
and then we will jointly do feature recon... We will generate a new image
by minimizing the feature reconstruction loss of the content image

01:08:02.596 --> 01:08:05.661
and the gram matrix
loss of the style image.

01:08:05.661 --> 01:08:14.353
And when we do these two things we a get a really cool image that kind of
renders the content image kind of in the artistic style of the style image.

01:08:14.353 --> 01:08:18.317
And now this is really cool. And you can
get these really beautiful figures.

01:08:18.317 --> 01:08:26.384
So again, what this kind of looks like is that you'll take your style image and your
content image pass them into your network to compute your gram matrices and your features.

01:08:26.384 --> 01:08:29.332
Now, you'll initialize your output image
with some random noise.

01:08:29.332 --> 01:08:38.264
Go forward, compute your losses go backward, compute your gradients on the image and repeat
this process over and over doing gradient ascent on the pixels of your generated image.

01:08:38.265 --> 01:08:43.247
And after a few hundred iterations,
generally you'll get a beautiful image.

01:08:43.247 --> 01:08:48.965
So, I have implementation of this online on my Gethub,
that a lot of people are using. And it's really cool.

01:08:48.965 --> 01:08:54.609
So, you can, this is kind of... Gives you a lot more
control over the generated image as compared to DeepDream.

01:08:54.609 --> 01:09:00.544
Right, so in DeepDream, you don't have a lot of control about exactly
what types of things are going to happen coming out at the end.

01:09:00.544 --> 01:09:06.500
You just kind of pick different layers of the networks maybe set
different numbers of iterations and then dog slugs pop up everywhere.

01:09:06.500 --> 01:09:11.228
But with style transfer, you get a lot more fine grain
control over what you want the result to look like.

01:09:11.228 --> 01:09:19.099
Right, by now, picking different style images with the same content image
you can generate whole different types of results which is really cool.

01:09:19.099 --> 01:09:30.349
Also, you can play around with the hyper parameters here. Right, because we're doing a joint reconstruct... We're minimizing
this feature reconstruction loss of the content image. And this gram matrix reconstruction loss of the style image.

01:09:30.350 --> 01:09:39.468
If you trade off the constant, the waiting between those two terms and the loss. Then you can get
control about how much we want to match the content versus how much we want to match the style.

01:09:39.469 --> 01:09:41.647
There's a lot of other hyper
parameters you can play with.

01:09:41.647 --> 01:09:45.707
For example, if you resize the style image
before you compute the gram matrix

01:09:45.707 --> 01:09:52.344
that can give you some control over what the scale of features
are that you want to reconstruct from the style image.

01:09:52.344 --> 01:09:58.976
So, you can see that here, we've done this same reconstruction the only
difference is how big was the style image before we computed the gram matrix.

01:09:58.976 --> 01:10:04.263
And this gives you another axis over
which you can control these things.

01:10:04.263 --> 01:10:07.670
You can also actually do style transfer
with multiple style images

01:10:07.670 --> 01:10:13.431
if you just match sort of multiple gram matrices at
the same time. And that's kind of a cool result.

01:10:13.431 --> 01:10:25.105
We also saw this multi-scale process... So, another cool thing you can do. We talked about this multi-scale processing
for DeepDream and saw how multi scale processing in DeepDream can give you some really cool resolution results.

01:10:25.105 --> 01:10:29.330
And you can do a similar type of multi-scale
processing in style transfer as well.

01:10:29.330 --> 01:10:40.867
So, then we can compute images like this. That a super high resolution, this is I
think a 4k image of our favorite school, like rendered in the style of Starry night.

01:10:40.867 --> 01:10:42.652
But this is actually super
expensive to compute.

01:10:42.652 --> 01:10:47.074
I think this one took four GPU's.
So, a little expensive.

01:10:47.074 --> 01:10:53.666
We can also other style, other style images. And get some really
cool results from the same content image. Again, at high resolution.

01:10:53.666 --> 01:11:01.168
Another fun thing you can do is you know, you can actually
do joint style transfer and DeepDream at the same time.

01:11:01.168 --> 01:11:09.017
So, now we'll have three losses, the content loss the style loss and
this... And this DeepDream loss that tries to maximize the norm.

01:11:09.017 --> 01:11:14.286
And get something like this. So, now it's Van
Gogh with the dog slug's coming out everywhere.

01:11:14.286 --> 01:11:15.858
[laughing]

01:11:15.858 --> 01:11:18.466
So, that's really cool.

01:11:18.466 --> 01:11:23.012
But there's kind of a problem with this style transfer
for algorithms which is that they are pretty slow.

01:11:23.012 --> 01:11:30.164
Right, you need to produce... You need to compute a lot of forward and backward
passes through your pretrained network in order to complete these images.

01:11:30.164 --> 01:11:38.200
And especially for these high resolution results that we saw in the previous slide. Each
forward and backward pass of a 4k image is going to take a lot of compute and a lot of memory.

01:11:38.200 --> 01:11:46.340
And if you need to do several hundred of those iterations generating these
images could take many, like tons of minutes even on a powerful GPU.

01:11:46.340 --> 01:11:50.320
So, it's really not so practical
to apply these things in practice.

01:11:50.320 --> 01:11:54.874
The solution is to now, train another neural
network to do the style transfer for us.

01:11:54.874 --> 01:12:03.164
So, I had a paper about this last year and the idea is that we're going to fix
some style that we care about at the beginning. In this case, Starry night.

01:12:03.164 --> 01:12:08.034
And now rather than running a separate optimization
procedure for each image that we want to synthesize

01:12:08.034 --> 01:12:15.748
instead we're going to train a single feed forward network that can
input the content image and then directly output the stylized result.

01:12:15.748 --> 01:12:26.848
And now the way that we train this network is that we compute the same content and style losses during training
of our feed forward network and use that same gradient to update the weights of the feed forward network.

01:12:26.848 --> 01:12:36.148
And now this thing takes maybe a few hours to train but once it's trained, then in order to
produce stylized images you just need to do a single forward pass through the trained network.

01:12:36.148 --> 01:12:49.880
So, I have a code for this online and you can see that it ends up looking about... Relatively comparable quality in
some cases to this very slow optimization base method but now it runs in real time it's about a thousand times faster.

01:12:49.880 --> 01:12:54.990
So, here you can see, this is like a
demo of it running live off my webcam.

01:12:54.990 --> 01:13:05.476
So, this is not running live right now obviously, but if you have a big GPU you can easily
run four different styles in real time all simultaneously because it's so efficient.

01:13:05.476 --> 01:13:12.650
There was... There was another group from Russia that had a very similar out...
That had a very similar paper concurrently and their results are about as good.

01:13:12.650 --> 01:13:15.392
They also had this kind
of tweek on the algorithm.

01:13:15.392 --> 01:13:25.450
So, this feed forward network that we're training ends up looking a lot like
these... These segmentation models that we saw. So, these segmentation networks,

01:13:25.450 --> 01:13:37.678
for semantic segmentation we're doing down sampling and then many, and then many layers then some up
sampling [mumbling] With transposed convulsion in order to down sample an up sample to be more efficient.

01:13:37.678 --> 01:13:45.244
The only difference is that this final layer produces a
three channel output for the RGB of that final image.

01:13:45.244 --> 01:13:48.540
And inside this network, we have batch
normalization in the various layers.

01:13:48.540 --> 01:13:56.027
But in this paper, they introduce... They swap out the batch normalization for
something else called instance normalization tends to give you much better results.

01:13:56.027 --> 01:14:05.500
So, one drawback of these types of methods is that we're now training one
new style transfer network... For every... For style that we want to apply.

01:14:05.500 --> 01:14:10.433
So that could be expensive if now you need to
keep a lot of different trained networks around.

01:14:10.433 --> 01:14:21.178
So, there was a paper from Google that just came... Pretty recently that addressed this by
using one feed forward trained network to apply many different styles to the input image.

01:14:21.178 --> 01:14:28.034
So now, they can train one network to apply many
different styles at test time using one trained network.

01:14:28.034 --> 01:14:36.477
So, here's it's going to take the content images input as well as the identity of the style
you want to apply and then this is using one network to apply many different types of styles.

01:14:36.477 --> 01:14:39.365
And again, runs in real time.

01:14:39.365 --> 01:14:44.442
That same algorithm can also do this kind of style
blending in real time with one trained network.

01:14:44.442 --> 01:14:52.458
So now, once you trained this network on these four different styles you can actually
specify a blend of these styles to be applied at test time which is really cool.

01:14:52.458 --> 01:15:01.976
So, these kinds of real time style transfer methods are on various
apps and you can see these out in practice a lot now these days.

01:15:01.976 --> 01:15:04.071
So, kind of the summary
of what we've seen today

01:15:04.071 --> 01:15:08.113
is that we've talked about many different
methods for understanding CNN representations.

01:15:08.113 --> 01:15:10.190
We've talked about some of
these activation based methods

01:15:10.190 --> 01:15:14.220
like nearest neighbor, dimensionality
reduction, maximal patches, occlusion images

01:15:14.220 --> 01:15:18.316
to try to understand based on the activation
values of what the features are looking for.

01:15:18.316 --> 01:15:20.461
We also talked about a bunch
of gradient based methods

01:15:20.461 --> 01:15:27.127
where you can use gradients to synthesize new images
to understand your features such as saliency maps

01:15:27.127 --> 01:15:30.417
class visualizations, fooling images,
feature inversion.

01:15:30.417 --> 01:15:37.997
And we also had fun by seeing how a lot of these similar ideas can be applied
to things like Style Transfer and DeepDream to generate really cool images.

01:15:37.997 --> 01:15:40.397
So, next time, we'll talk
about unsupervised learning

01:15:40.397 --> 01:15:45.834
Autoencoders, Variational Autoencoders and generative
adversarial networks so that should be a fun lecture.